-
Notifications
You must be signed in to change notification settings - Fork 585
Fix for unexpected socket closures and data leakage under heavy load #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| sockfd = sock.detach() | ||
| # libuv will make socket non-blocking | ||
| tr._open(sock.fileno()) | ||
| tr._open(sockfd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks correct -- but I'm wondering how vanilla asyncio handles the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think vanilla asyncio has an easier problem in that it can just have python sockets "all the way down", so just let reference counting take care of cleanup, while here we need to manage the disconnect with libuv dealing in file descriptors. I am suspecting there is some error handling path where a file descriptor is closed while the python socket object remains alive and not detached, so when it is finally closed, it messes up any new socket that happens to have the same file descriptor.
e.g. create socket s, call a loop method passing in an explicit socket, <bad error path which will end with sock.close()> overlapping with an .accept. I think the .accept never results in a python socket object being created.
So with the methods accepting sockets and other methods that internally work directly in file descriptors can there be a discrepancy?
|
@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc? Any possibility to add some test here? |
I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause. |
Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio) |
I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second). With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections. We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures. |
|
Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change. I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used ( |
|
Hello everyone. Did i think right that this MR fix issues below? "RuntimeError: File descriptor 2877 is used by transport <TCPTransport closed=False reading=True 0x55b8dc9baa90>" |
|
Hello everyone :) Like many other users of this library, I would be happy for this fix to be implemented in one of the upcoming releases. |
|
We added a workaround for this issue in aio-libs/aiohttp#10464 but its causing issues when using with asyncio SelectorEventLoop aio-libs/aiohttp#10617 so we will likely be reverting it and waiting for this PR instead |
…10464 fixes #10617 alternative fix is MagicStack/uvloop#646
…10464 (#10656) Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617
…10464 (#10656) Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617 (cherry picked from commit 06db052)
…10464 (#10656) Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617 (cherry picked from commit 06db052)
…'s a failure in start_connection() #10464 (#10657) **This is a backport of PR #10656 as merged into master (06db052).** Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617 Co-authored-by: J. Nick Koston <nick@koston.org>
|
Hey! I noticed that |
|
You can pin to a yanked version. |
|
I folks Wonder if their is a fix Our setup |
|
Pinning to |
Just a heads-up: if you're using the default asyncio event loop (typically Ideally, we were hoping this PR would be merged to avoid relying on workarounds in |
|
Can we please prioritize this PR, it seems to be impacting many users |
|
@fantix thanks for having a look at this. I am not sure what to make of the test failures. The failures seem to be all related to Unix transports and subprocess transports. The PR only should affect TCP transports. I'm trying to repro. My dev environment is Ubuntu / py3.12, and the tests that are failing here are passing there. For example: Do you have any ideas on how to proceed? |
|
They are breaking in the debug build, maybe try this: uvloop/.github/workflows/tests.yml Lines 68 to 71 in 96b7ed3
|
Yes, I get the failures with the debug build, good eye. Thanks. I have instrumented the changed code, and in a failing test, the modifications never even run (which makes sense since the test isn't creating any TCP connections). I have built without this patch, and still see the failures with the debug build. So, could an upstream dependency have broken the debug build? |
Since the last successful test run, the following upstream dependencies have changed: I rebuilt using Cython-3.0.12 and the tests passed. Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)? |
I forked the main branch and tried running the tests. It fails with Cython 3.1.0. I pinned Cython to < 3.1.0 and the tests pass. I included this PR, and with the pinned Cython, all tests pass. So I believe this PR could be merged. I created an issue for Cython 3.1.0 #677 . |
|
Hi checking back on this, any ETA on when it would be merged |
|
Following this PR waiting for the fix |
Uvloop has a bug that is preventing from updating the libraries so it's being deactivated for now. MagicStack/uvloop#646
Uvloop has a bug that is preventing from updating the libraries so it's being deactivated for now. MagicStack/uvloop#646
|
Hi folks any ETA on this |
|
Hi Folks |
|
We stopped using uvloop and didnt really observe any performance impact. Probably better to stop using it until the issue is fixed. Especially as we also observed information to get leaked under heavy load. (that issue is hard to reproduce locally) |
Hello, I also have this problem. This problem occurs when calling a third-party interface times out and is in a high-concurrency scenario. Have you solved it? Please advise. |
Uninstall uvloop? Several users have reported that the performance difference is small today, so if it's breaking your application... |
Yes we uninstalled it and did not observe any change in performance. (We process non-trivial amount of requests, +30K RPS, in highly concurrent servers.). |
> Briefly describe what this PR accomplishes and why it's needed. Our serve ingress keeps running into below error related to `uvloop` under heavy load ``` File descriptor 97 is used by transport ``` The uvloop team have a [PR](MagicStack/uvloop#646) to fix it, but seems like no one is working on it One of workaround mentioned in the ([PR](MagicStack/uvloop#646 (comment))) is to just turn off uvloop . We tried it in our env and didn't see any major performance difference Hence as part of this PR, we are defining a new env for controlling UVloop Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
commit b3a8434d35f7af0322e3b766b1a1809bd29c2837
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 14:31:31 2025 -0800
[doc] remove python 3.12 in doc building (#58572)
unifying to python 3.10
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 31f904f630809152ceba67c8bf1684c8c9b685ea
Author: Andrew Sy Kim <andrewsy@google.com>
Date: Thu Nov 13 17:27:23 2025 -0500
Add support for RAY_AUTH_MODE=k8s (#58497)
This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
will delegate authentication and authorization of Ray access to
Kubernetes TokenReview and SubjectAccessReview APIs.
---------
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
commit ade535a9519c19c25aa50c562d2c27128b3ca356
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 14:08:29 2025 -0800
[serve] fix serve dashboard metric name (#58573)
Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
historically has been supported counter metric with and without `_total`
suffix for backward compatibility, but it is now time to drop the
support (2 years since the warning was added).
There is one place in ray serve dashboard that still doesn't use the
`_total` suffix so fix it in this PR.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Thu Nov 13 13:33:33 2025 -0800
[Serve.LLM] Add avg prompt length metric (#58599)
Add avg prompt length metric
When using uniform prompt length (especially in testing), the P50 and
P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
Average prompt length provides another useful dimension to look at and
validate.
For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
9400, and avg accurately shows 5000.
<img width="1186" height="466" alt="image"
src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
/>
---------
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Thu Nov 13 12:42:49 2025 -0800
[release] allowing for py3.13 images (cpu & cu123) in release tests (#58581)
allowing for py3.13 images (cpu & cu123) in release tests
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit c3ba35e6cb1ce4030d8d361a921a697af516fbca
Author: Goutam <goutam@anyscale.com>
Date: Thu Nov 13 12:26:10 2025 -0800
[Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225)
As title suggests
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: Goutam <goutam@anyscale.com>
commit af20446c362a8f4d17b9226d944a3242b0acafaf
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 12:18:38 2025 -0800
[core] fix get_metric_check_condition tests (#58598)
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
which is a non-flaky version of `fetch_prometheus`. Update all of test
usage accordingly.
Test:
- CI
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit f1c613dc386268beec06b6c57c12191218ae7e74
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 12:14:04 2025 -0800
[core] add an option to disable otel sdk error logs (#58257)
Currently, Ray metrics and events are exported through a centralized
process called the Dashboard Agent. This process functions as a gRPC
server, receiving data from all other components (GCS, Raylet, workers,
etc.). However, during a node shutdown, the Dashboard Agent may
terminate before the other components, resulting in gRPC errors and
potential loss of metrics and events.
As this issue occurs, the otel sdk logs become very noisy. Add a default
options to disable otel sdk logs to avoid confusion.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 638933ef4aabe24b5def68d72f21e772e354e853
Author: Abrar Sheikh <abrar@anyscale.com>
Date: Thu Nov 13 11:41:29 2025 -0800
[1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471)
2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns
3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers
4. **Simplified error handling** - not supporting self healing
5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases
**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support
Next PR https://github.com/ray-project/ray/pull/58473
---------
Signed-off-by: abrar <abrar@anyscale.com>
commit 5d5113134bce5929ff7504f733bbee44a7de2987
Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Date: Thu Nov 13 11:21:50 2025 -0800
[Core] Refactor reference_counter out of memory store and plasma store (#57590)
As discovered in the [PR to better define the interface for reference
counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933),
plasma store provider and memory store both share thin dependencies on
reference counter that can be refactored out. This will reduce
entanglement in our code base and improve maintainability.
The main logic changes are located in
* src/ray/core_worker/store_provider/plasma_store_provider.cc, where
reference counter related logic is refactor into core worker
* src/ray/core_worker/core_worker.cc, where factored out reference
counter logic is resolved
* src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
logic related to reference counter has either been removed due to the
fact that it is tech debt or refactored into caller functions.
<!-- Please give a short summary of the change and the problem this
solves. -->
<!-- For example: "Closes #1234" -->
Microbenchmark:
```
single client get calls (Plasma Store) per second 10592.56 +- 535.86
single client put calls (Plasma Store) per second 4908.72 +- 41.55
multi client put calls (Plasma Store) per second 14260.79 +- 265.48
single client put gigabytes per second 11.92 +- 10.21
single client tasks and get batch per second 8.33 +- 0.19
multi client put gigabytes per second 32.09 +- 1.63
single client get object containing 10k refs per second 13.38 +- 0.13
single client wait 1k refs per second 5.04 +- 0.05
single client tasks sync per second 960.45 +- 15.76
single client tasks async per second 7955.16 +- 195.97
multi client tasks async per second 17724.1 +- 856.8
1:1 actor calls sync per second 2251.22 +- 63.93
1:1 actor calls async per second 9342.91 +- 614.74
1:1 actor calls concurrent per second 6427.29 +- 50.3
1:n actor calls async per second 8221.63 +- 167.83
n:n actor calls async per second 22876.04 +- 436.98
n:n actor calls with arg async per second 3531.21 +- 39.38
1:1 async-actor calls sync per second 1581.31 +- 34.01
1:1 async-actor calls async per second 5651.2 +- 222.21
1:1 async-actor calls with args async per second 3618.34 +- 76.02
1:n async-actor calls async per second 7379.2 +- 144.83
n:n async-actor calls async per second 19768.79 +- 211.95
```
This PR mainly makes logic changes to the `ray.get` call chain. As we
can see from the benchmark above, the single clientget calls performance
matches pre-regression levels.
---------
Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Fri Nov 14 00:49:39 2025 +0530
[Core] Support get-auth-token cli command (#58566)
add support for `ray get-auth-token` cli command + test
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Fri Nov 14 00:37:23 2025 +0530
[Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591)
Migrates Ray dashboard authentication from JavaScript-managed cookies to
server-side HttpOnly cookies to enhance security against XSS attacks.
This addresses code review feedback to improve the authentication
implementation (https://github.com/ray-project/ray/pull/58368)
main changes:
- authentication middleware first looks for `Authorization` header, if
not found it then looks at cookies to look for the auth token
- new `api/authenticate` endpoint for verifying token and setting the
auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
`secure=true` (when using https))
- removed javascript based cookie manipulation utils and axios
interceptors (were previously responsible for setting cookies)
- cookies are deleted when connecting to a cluster with
`AUTH_MODE=disabled`. connecting to a different ray cluster (with
different auth token) using the same endpoint (eg due to port-forwarding
or local testing) will reshow the popup and ask users to input the right
token.
---------
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 0905c77db5acd286a6ba84a907c60ad2b15416dd
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:41:57 2025 -0800
[ci] doc check: remove dependency on `ray_ci` (#58516)
this makes it possible to run on a different python version than the CI
wrapper code.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 0bbd8fd22e0447ec66c12e67afc973e95523451b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:35:38 2025 -0800
[ci] mark github.Repository as typechecking (#58582)
so that importing test.py does not always import github
github repo imports jwt, which then imports cryptography and can lead to
issues on windows.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 208970b5b399133a41557db8b16ad6832180e6b7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:35:23 2025 -0800
[wheel] stop building python 3.9 wheels on the pipelines (#58587)
also stops building python 3.9 aarch64 images
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:32:21 2025 -0800
[serve] run tests in python 3.10 (#58586)
all tests are passing
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01
Author: Zac Policzer <zac@anyscale.com>
Date: Thu Nov 13 07:37:52 2025 -0800
[core] Add monitoring in raylet for resouce view (#58382)
We today have very little observability into pubsub. On a raylet one of
the most important states that need to be propagated through the cluster
via pubsub is cluster membership. All raylets should in an eventual BUT
timely fashion agree on the list of available nodes. This metric just
emits a simple counter to keep track of the node count.
More pubsub observability to come.
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: zac <zac@anyscale.com>
Signed-off-by: Zac Policzer <zacattackftw@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit dde70e76e5aa993e9224a2d173a053a35a132ebd
Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com>
Date: Wed Nov 12 23:04:37 2025 -0800
[Data] Fix HTTP streaming file download by using `open_input_stream` (#58542)
Fixes HTTP streaming file downloads in Ray Data's download operation.
Some URIs (especially HTTP streams) require `open_input_stream` instead
of `open_input_file`.
- Modified `download_bytes_threaded` in `plan_download_op.py` to try
both `open_input_file` and `open_input_stream` for each URI
- Improved error handling to distinguish between different error types
- Failed downloads now return `None` gracefully instead of crashing
```
import pyarrow as pa
from ray.data.context import DataContext
from ray.data._internal.planner.plan_download_op import download_bytes_threaded
urls = [
"https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
]
table = pa.table({"url": urls})
ctx = DataContext.get_current()
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
result_table = results[0]
for i in range(result_table.num_rows):
url = result_table['url'][i].as_py()
bytes_data = result_table['bytes'][i].as_py()
if bytes_data is None:
print(f"Row {i}: FAILED (None) - try-catch worked ✓")
else:
print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
print(f" URL: {url[:60]}...")
print("\n✅ Test passed: Failed downloads return None instead of crashing.")
```
Before the fix:
```
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
test_download_expression_with_streaming_fallback()
File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
if not self.__exit__(*sys.exc_info()):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
setattr(self.target, self.attribute, self.temp_original)
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
(base) ray@ip-10-0-39-21:~/default$ python test.py
2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
Traceback (most recent call last):
File "/home/ray/default/test.py", line 16, in <module>
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
uri_bytes = list(
^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
raise item
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
```
After the fix:
```
Row 0: SUCCESS (189370 bytes)
URL: https://static-assets.tesla.com/configurator/compositor?cont...
```
Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
previously failed:
- ✅ Successfully downloads HTTP stream files
- ✅ Gracefully handles failed downloads (returns None)
- ✅ Maintains backward compatibility with existing file downloads
---------
Signed-off-by: xyuzh <xinyzng@gmail.com>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
commit 438d6dcf225b7b03ba75ce9593050971458b94ac
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 22:19:50 2025 -0800
[ci] pin docker client version (#58579)
otherwise, the newer docker client will refuse to communicate with the
docker daemon that is on an older version.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Nov 12 22:08:45 2025 -0800
[deps] adding include_setuptools flag for depset config (#58580)
Adding optional `include_setuptools` flag for depset configuration
If the flag is set on a depset config --unsafe-package setuptools will
not be included for depset compilation
If the flag does not exist (default false) on a depset config
--unsafe-package setuptools will be appended to the default arguments
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 20:36:43 2025 -0800
[serve] remove minbuild-serve-py3.9 (#58585)
nothing is using it anymore
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 0cdbe3f24132c69c4d6ce9322f85de767b660135
Author: Ibrahim Rabbani <irabbani@anyscale.com>
Date: Wed Nov 12 18:48:27 2025 -0800
[core] (cgroups) Use /proc/mounts if mount file is missing. (#58577)
Signed-off-by: irabbani <irabbani@anyscale.com>
commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 18:26:25 2025 -0800
[deps] update `requirements_buildkite.txt` (#58574)
as the pydantic version is pinned in `requirements-doc.txt` now.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 16:38:54 2025 -0800
Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578)
Reverts ray-project/ray#58535
failing on windows.. :(
commit 2f55d078bb69f39198eccf6293683e17a2e72dc5
Author: Goutam <goutam@anyscale.com>
Date: Wed Nov 12 16:37:24 2025 -0800
[Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270)
- Support upserting iceberg tables for IcebergDatasink
- Update schema on APPEND and UPSERT
- Enable overwriting the entire table
Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
functionality. Also for append, the library now handles the transaction
logic implicitly so that burden can be lifted from Ray Data.
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: Goutam <goutam@anyscale.com>
commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f
Author: Joshua Lee <73967497+Sparks0219@users.noreply.github.com>
Date: Wed Nov 12 16:31:26 2025 -0800
[core] Use GetNodeAddressAndLiveness in raylet client pool (#58576)
Using GetNodeAddressAndLiveness in raylet client pool instead of the
bulkier Get, same for AsyncGetAll. Seems like it was already done in
core worker client pool, so just making the same change for raylet
client pool.
Signed-off-by: joshlee <joshlee@anyscale.com>
commit e713b3de319afd437f2de7435f5a2870167fa99a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 15:01:35 2025 -0800
[doc] set default python env to 3.10 (#58570)
we stop supporting building with python 3.9 now
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 15:01:20 2025 -0800
[bazel] rename contraint from hermatic to python_version (#58499)
which is more accurate
also moves python constraint definitions into `bazel/` directory and
registering python 3.10 platform with hermetic toolchain
this allows performing migration from python 3.19 to python 3.10
incrementally
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Nov 12 14:23:17 2025 -0800
[images][deps] raydepsets base extra depset (#58461)
generating depsets for base extra python requirements
Installing requirements in base extra image
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit df65225e4f98bce2b45405b1cf89fb70556e2871
Author: Daniel Shin <88547237+kyuds@users.noreply.github.com>
Date: Thu Nov 13 07:08:15 2025 +0900
[Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371)
Currently Ray Data has a preprocessor called `RobustScaler`. This scales
the data based on given quantiles. Calculating the quantiles involves
sorting the entire dataset by column for each column (C sorts for C
number of columns), which, for a large dataset, will require a lot of
calculations.
** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch`
as I couldn't actually find well-maintained tdigest libraries for
python. ddsketch is better maintained.
** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile`
aggregator
N/A
N/A
---------
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
commit 5e71d58badbfdcfc002826398c3e02469065cc71
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Thu Nov 13 03:33:18 2025 +0530
[Core] support token auth in ray client server (#58557)
support token auth in ray client server by using the existing grpc
interceptors. This pr refactors the code to:
- add/rename sync and async client and server interceptors
- create grpc utils to house grpc channel and server creation logic,
python codebase is updated to use these methods
- separate tests for sync and async interceptors
- make existing authentication integration tests to run with RAY_CLIENT
mode
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd
Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Date: Wed Nov 12 13:45:02 2025 -0800
[Core] Move request id creation to worker to address plasma get perf regression (#58390)
This PR address the performance regression introduced in the [PR to make
ray.get thread safe](https://github.com/ray-project/ray/pull/57911).
Specifically, the previous PR requires the worker to block and wait for
AsyncGet to return with a reply of the request id needed for correctly
cleaning up get requests. This additional synchronous step causes the
plasma store Get to regress in performance.
This PR moves the request id generation step to the plasma store,
removing the blocking step to fix the perf regression.
- [PR which introduced perf
regression](https://github.com/ray-project/ray/pull/57911)
- [PR which observed the
regression](https://github.com/ray-project/ray/pull/58175)
New performance of the change measured by `ray microbenchmark`.
<img width="485" height="17" alt="image"
src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0"
/>
Original performance prior to the change. Here we focus on the
regressing `single client get calls (Plasma Store)` metric, where our
new performance returns us back to the original 10k per second range
compared to the existing sub 5k per second.
<img width="811" height="355" alt="image"
src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c"
/>
---------
Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
commit 9e450e6805824ac825488e1455ac97f93df0bbc3
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 12:36:21 2025 -0800
[doc] symlink the doc dependency lock file (#58520)
and ask people to use that lock file for building docs.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4
Author: Lehui Liu <lehui@anyscale.com>
Date: Wed Nov 12 12:08:28 2025 -0800
[train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783)
1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set
to initialize jax.distributed:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38
2. Before this change, user will have to configure both `use_tpu=True`
in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able
to start jax.distributed. `JAX_PLATFORMS` can be comma separated string.
3. If user uses other jax.distributed libraries like Orbax, sometimes,
it will leads to misleading error about distributed initialization.
4. After this change, if user sets `use_tpu=True`, we automatically add
this to env var.
5. tpu unit test is not available this time, will explore for how to
cover it later.
---------
Signed-off-by: Lehui Liu <lehui@anyscale.com>
commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Wed Nov 12 12:04:16 2025 -0800
[Data] Add `Ranker` Interface (#58513)
Creates a ranker interface that will rank the best operator to run next
in `select_operator_to_run`. This code only refractors the existing
code. The ranking value must be something that is comparable.
None
None
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 11:50:42 2025 -0800
[bazel] upgrade bazel python rules to 0.25.0 (#58535)
previously it was actually using 0.4.0, which is set up by the grpc
repo. the declaration in the workspace file was being shadowed..
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 02afe68937429bfd6501e4d0f46780bca4dea329
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Wed Nov 12 11:34:59 2025 -0800
[Data] Refactor concurrency validation tests in `test_map.py` (#58549)
The original `test_concurrency` function combined multiple test
scenarios into a single test with complex control flow and expensive Ray
cluster initialization. This refactoring extracts the parameter
validation tests into focused, independent tests that are faster,
clearer, and easier to maintain.
Additionally, the original test included "validation" cases that tested
valid concurrency parameters but didn't actually verify that concurrency
was being limited correctly—they only checked that the output was
correct, which isn't useful for validating the concurrency feature
itself.
**Key improvements:**
- Split validation tests into `test_invalid_func_concurrency_raises` and
`test_invalid_class_concurrency_raises`
- Use parametrized tests for different invalid concurrency values
- Switch from `shutdown_only` with explicit `ray.init()` to
`ray_start_regular_shared` to eliminate cluster initialization overhead
- Minimize test data from 10 blocks to 1 element since we're only
validating parameter errors
- Remove non-validation tests that didn't verify concurrency behavior
N/A
The validation tests now execute significantly faster and provide
clearer failure messages. Each test has a single, well-defined purpose
making maintenance and debugging easier.
---------
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Wed Nov 12 11:32:48 2025 -0800
[Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523)
This PR improves documentation consistency in the `python/ray/data`
module by converting all remaining rST-style docstrings (`:param:`,
`:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).
**Files modified:**
- `python/ray/data/preprocessors/utils.py` - Converted
`StatComputationPlan.add_callable_stat()`
- `python/ray/data/preprocessors/encoder.py` - Converted
`unique_post_fn()`
- `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
and `BlockColumnAccessor.is_composed_of_lists()`
- `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 7e872837e450411e9da45acea0c52f4b67221500
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Wed Nov 12 09:07:32 2025 -0800
[serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
commit cd09d104f6d595a805fd8f9979d9f81a828823b5
Author: Alexey Kudinkin <ak@anyscale.com>
Date: Wed Nov 12 11:50:05 2025 -0500
[Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262)
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.
> ⚠️ Remove these instructions before submitting your PR.
> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.
This was setting the value to be aligned with the previous default of 4.
However, after some consideration i've realized that 4 is too high of a
number so actually lowering this to 2
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
commit 126a40bc711cf06ed44686ee5026624d6b78766e
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Wed Nov 12 07:44:53 2025 -0800
[core] fix idle node termination on object pulling (#57928)
Currently, a node is considered idle while pulling objects from the
remote object store. This can lead to situations where a node is
terminated as idle, causing the cluster to enter an infinite loop when
pulling large objects that exceed the node idle termination timeout.
This PR fixes the issue by treating object pulling as a busy activity.
Note that nodes can still accept additional tasks while pulling objects
(since pulling consumes no resources), but the auto-scaler will no
longer terminate the node prematurely.
Closes #54372
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit ad8f30291137efce9e463fb23e6821f4c7c74a9c
Author: Sagar Sumit <sagarsumit09@gmail.com>
Date: Wed Nov 12 05:40:47 2025 -0800
[core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090)
When actors terminate gracefully, Ray calls the actor's
`__ray_shutdown__()` method if defined, allowing for cleanup of
resources. But, this is not invoked in case actor goes out of scope due
to `del actor`.
Traced through the entire code path, and here's what happens:
Flow when `del actor` is called:
1. **Python side**: `ActorHandle.__del__()` ->
`worker.core_worker.remove_actor_handle_reference(actor_id)`
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040
2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
`reference_counter_->RemoveLocalReference()`
- When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
callback
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506
3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
`AsyncReportActorOutOfScope()` to GCS
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51
4. **GCS receives notification**: `HandleReportActorOutOfScope()`
- **THE PROBLEM IS HERE** ([line 279 in
`src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
```cpp
DestroyActor(actor_id,
GenActorOutOfScopeCause(actor),
/*force_kill=*/true, // <-- HARDCODED TO TRUE!
[reply, send_reply_callback]() {
```
5. **Actor worker receives kill signal**: `HandleKillActor()` in
[`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
```cpp
if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE
ForceExit(...) // Skips __ray_shutdown__
} else {
Exit(...) // Would call __ray_shutdown__
}
```
6. **ForceExit path**: Bypasses graceful shutdown -> No
`__ray_shutdown__` callback invoked.
This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
actors. Also, updated the docs.
---------
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
commit 15393edbe72f5079279d3a0e46b72adc7496cdfc
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Wed Nov 12 19:00:10 2025 +0530
[Core] use client interceptor for adding auth token in c++ client calls (#58424)
- Use client interceptor for adding auth tokens in grpc calls when
`AUTH_MODE=token`
- BuildChannel() will automatically include the interceptor
- Removed `auth_token` parameter from `ClientCallImpl`
- removed manual auth from `python_gcs_subscriber`.cc
- tests to verify auth works for autoscaller apis
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit d496ea87808706333703be6ff25ecc9472330fd5
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Wed Nov 12 11:25:11 2025 +0530
[core] Token auth usability improvements (#58408)
- rename RAY_auth_mode → RAY_AUTH_MODE environment variable across
codebase
- Excluded healthcheck endpoints from authentication for Kubernetes
compatibility
- Fixed dashboard cookie handling to respect auth mode and clear stale
tokens when switching clusters
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b
Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Date: Tue Nov 11 19:50:52 2025 -0800
[doc][serve][llm] Attached the correct figure to the pd docs (#58543)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
commit a15f5be797ced0df321bfd8d42bab7d57defa2de
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Tue Nov 11 18:00:43 2025 -0800
[doc] downgrade readthedocs to use python 3.10 (#58536)
be consistent with the default build environment
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 17:26:15 2025 -0800
[core] Fix auth test import (#58554)
The python test step is failing on master now because of this. Probably
a logical merge conflict.
```
FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
...
[2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import (
--
| [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
```
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 20bf68263beed3609e24aede3d9fc96bc07f0da0
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:44:05 2025 -0800
[core][rdt] Abort NIXL and allow actor reuse on failed transfers (#56783)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 89a329cd1e0219629132abc203085117a11949f3
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:26:17 2025 -0800
[core] Improve kill actor logs (#58544)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 6c9607ea57b9edde07c856f094835c84f47b79a6
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Tue Nov 11 12:16:41 2025 -0800
[docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
commit 711d9453828fecebb91b9642e799b4b0b4a493f7
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:13:13 2025 -0800
[core] Make GlobalState lazy initialization thread-safe (#58182)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date: Tue Nov 11 09:43:05 2025 -0800
[core] Scheduling a detached actor with a placement group is not recommended (#57726)
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->
If users schedule a detached actor into a placement group, Raylet will
kill the actor when the placement group is removed. The actor will be
stuck in the `RESTARTING` state forever if it's restartable until users
explicitly kill it.
In that case, if users try to `get_actor` with the actor's name, it can
still return the restarting actor, but no process exists. It will no
longer be restarted because the PG is gone, and no PG with the same ID
will be created during the cluster's lifetime.
The better behavior would be for Ray to transition a task/actor's state
to dead when it is impossible to restart. However, this would add too
much complexity to the core, so I think it's not worth it. Therefore,
this PR adds a warning log, and users should use detached actors or PGs
correctly.
Example: Run the following script and run `ray list actors`.
```python
import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from ray.util.placement_group import placement_group, remove_placement_group
@ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
class Actor:
pass
ray.init()
pg = placement_group([{"CPU": 1}])
ray.get(pg.ready())
actor = Actor.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=pg,
)
).remote()
ray.get(actor.__ray_ready__.remote())
```
<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to
- [ ] Bug fix 🐛
- [ ] New feature ✨
- [x] Enhancement 🚀
- [ ] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨
**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->
**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_
**Code Quality:**
- [x] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)
<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->
---------
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 0752886e7d55694b6cf8d780b7470d58266c6a10
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Tue Nov 11 07:19:19 2025 -0800
[core] enable open telemetry by default (#56432)
This PR enables open telemetry as the default backend for ray metric
stack. The bulk of this PR is actually to fix tests that were written
with some assumptions that no longer hold true. For ease of reviewing, I
inline the reasons for the change together with the change for each
tests in the comments.
This PR also depends on a release of vllm (so that we can update the
minimal supported version of vllm in ray).
Test:
- CI
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Enable OpenTelemetry metrics backend by default and refactor
metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
metric names.
>
> - **Core/Config**:
> - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
`true` in `ray_constants.py` and `ray_config_def.h`.
> - Metrics `Counter`: use `CythonCount` by default; keep legacy
`CythonSum` only when OTEL is explicitly disabled.
> - **Serve/Metrics Tests**:
> - Replace text scraping with `PrometheusTimeseries` and
`fetch_prometheus_metric_timeseries` throughout.
> - Update metric names/tags to `ray_serve_*` and counter suffixes
`*_total`; adjust latency metric names and processing/queued gauges.
> - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
pass through helpers.
> - **General Test Fixes**:
> - Remove OTEL parametrization/fixtures; simplify expectations where
counters-as-gauges no longer apply; drop related tests.
> - Cardinality tests: include `"low"` level and remove OTEL gating;
stop injecting `enable_open_telemetry` in system config.
> - Actor/state/thread tests: migrate to cluster fixtures, wait for
dashboard agent, and adjust expected worker thread counts.
> - **Build**:
> - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
from C++ stats test.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit bf595e32d049503f5c1931c5b477647a06d191c2
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Tue Nov 11 19:15:41 2025 +0530
[Core] move authentication_test_utils into ray._private to fix macos tests (#58528)
the auth token test setup in `conftest.py` is breaking macos test. there
are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
that run after the wheel is installed but without editable mode. for
these test to pass,` conftest.py` cannot import anything under
`ray.tests`.
this pr moves `authentication_test_utils` into `ray._private` to fix
this issue
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Tue Nov 11 19:10:56 2025 +0530
[Core] Add Service Interceptor to support token authentication in dashboard agent (#58405)
Add a grpc service interceptor to intercept all dashboard agent rpc
calls and validate the presence of auth token (when auth mode is token)
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 1a48e7318442d038f2c43d22da3b580fa643b8d1
Author: curiosity-hyf <curiooosity.h@gmail.com>
Date: Tue Nov 11 21:35:42 2025 +0800
[Docs] fix pattern_async_actor demo typo (#58486)
fix pattern_async_actor demo typo. Add `self.`.
---------
Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>
commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date: Mon Nov 10 18:28:46 2025 -0800
Update pydoclint to version 0.8.1 (#58490)
* Does the work to bump pydoclint up to the latest version
* And allowlist any new violations it finds
n/a
n/a
---------
Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>
commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1
Author: Goutam <goutam@anyscale.com>
Date: Mon Nov 10 17:34:13 2025 -0800
[Data] - Iceberg support predicate & projection pushdown (#58286)
Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in
conjunction with this PR should speed up reads from Iceberg.
Once the above change lands, we can add the pushdown interface support
for IcebergDatasource
---------
Signed-off-by: Goutam <goutam@anyscale.com>
commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Mon Nov 10 17:28:23 2025 -0800
[serve][llm] Fix import path in muli-node release test (#58498)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
commit 405c4648c2fe71afb7daf4ea574605190f129fd7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 16:04:48 2025 -0800
[ci] upgrade rayci version (#58514)
to 0.21.0; supports wanda priority now.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 6de012fd0df23993054653ca5517a66944c58dd2
Author: Zac Policzer <zac@anyscale.com>
Date: Mon Nov 10 14:05:15 2025 -0800
[core] Add owned object spill metrics (#57870)
This PR adds 2 new metrics to core_worker by way of the reference
counter. The two new metrics keep track of the count and size of objects
owned by the worker as well as keeping track of their states. States are
defined as:
- **PendingCreation**: An object that is pending creation and hasn't
finished it's initialization (and is sizeless)
- **InPlasma**: An object which has an assigned node address and isn't
spilled
- **Spilled**: An object which has an assigned node address and is
spilled
- **InMemory**: An object which has no assigned address but isn't
pending creation (and therefore, must be local)
The approach used by these new metrics is to examine the state 'before
and after' any mutations on the reference in the reference_counter. This
is required in order to do the appropriate bookkeeping (decrementing
values and incrementing others). Admittedly, there is potential for
counting on the in between decrements/increments depending on when the
RecordMetrics loop is run. This unfortunate side effect however seems
preferable to doing mutual exclusion with metric collection as this is
potentially a high throughput code path.
In addition, performing live counts seemed preferable then doing full
accounting of the object store and across all references at time of
metric collection. Reason being, that potentially the reference counter
is tracking millions of objects, and each metric scan could potentially
be very expensive. So running the accounting (despite being potentially
innaccurate for short periods) seemed the right call.
This PR also allows for object size to potentially change due to
potential non deterministic instantation (say an object is initially
created, but it's primary copy dies, and then the recreation fails).
This is an edge case, but seems important for completeness sake.
---------
Signed-off-by: zac <zac@anyscale.com>
commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:02:11 2025 -0800
[java] remove local genrule `//java:ray_java_pkg` (#58503)
using `bazelisk run //java:gen_ray_java_pkg` everywhere
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b23adc777c5b103291cf3a35b51b123a808d36f6
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:01:27 2025 -0800
[ci] apply isort to release test directory, part 1 (#58505)
excluding `*_tests` directories for now to reduce the impact
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:01:06 2025 -0800
[doc] change link check to run on python 3.12 (#58506)
migrating all doc related things to run on python 3.12
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b09b076e15fefe842a0b7e33accff71ec3c31435
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:00:01 2025 -0800
[doc] ci: move doc annotation check to python 3.12 (#58507)
be consistent with doc build environment
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 8971f83ecb40d54729c2c26d394594c29199e19d
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Mon Nov 10 12:52:43 2025 -0800
[data] Clear queue for manually mark_execution_finished operators (#58441)
Currently, we clear _external_ queues when an operator is manually
marked as finished. But we don't clear their _internal_ queues. This PR
fixes that
Fixes this test
https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit ffb51f866802ad3858d82a9356855a38503efec9
Author: Matthew Owen <mowen@anyscale.com>
Date: Mon Nov 10 10:54:34 2025 -0800
[data] Update depsets for multimodal inference release tests (#57233)
Update remaining mulitmodal release tests to use new depsets.
commit 62231dd4ba8e784da8800b248ad7616b8db92de7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 10:30:00 2025 -0800
[ci] seperate doc related jobs into its own group (#58454)
so that they are not called lints any more
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2
Author: harshit-anyscale <harshit@anyscale.com>
Date: Mon Nov 10 23:45:38 2025 +0530
increase timeout for test_initial_replica tests (#58423)
- `test_target_capacity` windows test is failing, possibly because we
have put up a short timeout of 10 seconds, increasing it to verify
whether timeout is an issue or not.
Signed-off-by: harshit <harshit@anyscale.com>
commit 217031a48f4f83d04950ad39b94846ba362edd37
Author: Jugal Shah <47508441+jugalshah291@users.noreply.github.com>
Date: Mon Nov 10 09:39:43 2025 -0800
Define an env for controlling UVloop (#58442)
> Briefly describe what this PR accomplishes and why it's needed.
Our serve ingress keeps running into below error related to `uvloop`
under heavy load
```
File descriptor 97 is used by transport
```
The uvloop team have a
[PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems
like no one is working on it
One of workaround mentioned in the
([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982))
is to just turn off uvloop .
We tried it in our env and didn't see any major performance difference
Hence as part of this PR, we are defining a new env for controlling
UVloop
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
commit 2486ddd9fec83cc940937e3d91368942588ef177
Author: fscnick <6858627+fscnick@users.noreply.github.com>
Date: Mon Nov 10 23:29:03 2025 +0800
[Doc][KubeRay] eliminate vale errors (#58429)
Fix some vale's error and suggestions on the kai-scheduler document.
See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719
Signed-off-by: fscnick <fscnick.dev@gmail.com>
commit cb6a60d0afcfca87734a399291343e297031f1d5
Author: Daniel Sperber <github.blurry@9ox.net>
Date: Mon Nov 10 16:24:34 2025 +0100
[air] Add stacklevel option to deprecation_warning (#58357)
Currently are deprecation warnings sometimes not informative enough. The
the warning is triggered it does not tell us *where* the deprecated
feature is used. For example, ray internally raises a deprecation
warning when an `RLModuleConfig` is initialized.
```python
>>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig
>>> RLModuleConfig()
2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
```
This is confusing, where did *I* use a config, what am I doing wrong?
This raises issues like:
https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064
Tracing where the error actually happens is tedious - is it my code or
internal? The output just shows `deprecation.:50`. Not helpful.
This PR adds a stacklevel option with stacklevel=2 as the default to all
`deprecation_warning`s. So devs and users can better see where is the
deprecated option actually used.
---
EDIT:
**Before**
```python
WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])`
```
**After** module.py:line where the deprecated artifact is used is shown
in the log output:
When building an Algorithm:
```python
WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
```
```python
.../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
```
Signed-off-by: Daraan <github.blurry@9ox.net>
commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Mon Nov 10 20:50:21 2025 +0530
[core] Configure an interceptor to pass auth token in python direct g… (#58395)
there are places in the python code where we use the raw grpc library to
make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term
we want to fully deprecate grpc library usage in our python code base
but as that can take more effort and testing, in this pr I am
introducing an interceptor to add auth headers (this will take effect
for all grpc calls made from python).
```
export RAY_auth_mode="token"
export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
ray start --head
ray job submit -- echo "hi"
```
output
```
ray job submit -- echo "hi"
2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads.
Job submission server address: http://127.0.0.1:8265
-------------------------------------------------------
Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_1EV8q86uKM24nHmH
Query the status of the job:
ray job status raysubmit_1EV8q86uKM24nHmH
Request the job to be stopped:
ray job stop raysubmit_1EV8q86uKM24nHmH
Tailing logs until the job exits (disable with --no-wait):
2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up.
hi
Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi
------------------------------------------
Job 'raysubmit_1EV8q86uKM24nHmH' succeeded
------------------------------------------
```
dashboard
test.py
```python
import time
import ray
from ray._raylet import Config
ray.init()
@ray.remote
def print_hi():
print("Hi")
time.sleep(2)
@ray.remote
class SimpleActor:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
actor = SimpleActor.remote()
result = ray.get(actor.increment.remote())
for i in range(100):
ray.get(print_hi.remote())
time.sleep(20)
ray.shutdown()
```
```
export RAY_auth_mode="token"
export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
python test.py
```
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292"
/>
overview page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762"
/>
job page: tasks are listed
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a"
/>
task page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136"
/>
actors page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459"
/>
specific actor page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0"
/>
---------
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f
Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com>
Date: Sun Nov 9 08:34:46 2025 -0800
[Data] Add exception handling for invalid URIs in download operation (#58464)
commit d74c1570543045a0f99df4d5690ac44f1fda4a55
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Sat Nov 8 15:35:11 2025 -0800
[dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384)
Pass in `status_code` directly into `do_reply`. This is a follow up to
https://github.com/ray-project/ray/pull/58255
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit e793631896f65a88513510b4e7bf6f100607cb03
Author: Rueian <rueiancsie@gmail.com>
Date: Sat Nov 8 15:32:10 2025 -0800
[core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)
This ensures node type names are correctly reported even when the
autoscaler is disabled (read-only mode).
Autoscaler v2 fails to report prometheus metrics when operating in
read-only mode on KubeRay with the following KeyError error:
```
2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group'
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
return Reconciler.reconcile(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
Reconciler._step_next(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
Reconciler._scale_cluster(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
reply = scheduler.schedule(sched_request)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
node_config = ctx.get_node_type_configs()[node_type]
KeyError: 'small-group'
```
This happens because the `ReadOnlyProviderConfigReader` populates
`ctx.get_node_type_configs()` using node IDs as node types, which is
correct for local Ray (where local ray does not have
`RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
`ray_node_type_name` is present and expected wi…
commit b3a8434d35f7af0322e3b766b1a1809bd29c2837
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 14:31:31 2025 -0800
[doc] remove python 3.12 in doc building (#58572)
unifying to python 3.10
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 31f904f630809152ceba67c8bf1684c8c9b685ea
Author: Andrew Sy Kim <andrewsy@google.com>
Date: Thu Nov 13 17:27:23 2025 -0500
Add support for RAY_AUTH_MODE=k8s (#58497)
This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
will delegate authentication and authorization of Ray access to
Kubernetes TokenReview and SubjectAccessReview APIs.
---------
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
commit ade535a9519c19c25aa50c562d2c27128b3ca356
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 14:08:29 2025 -0800
[serve] fix serve dashboard metric name (#58573)
Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
historically has been supported counter metric with and without `_total`
suffix for backward compatibility, but it is now time to drop the
support (2 years since the warning was added).
There is one place in ray serve dashboard that still doesn't use the
`_total` suffix so fix it in this PR.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Thu Nov 13 13:33:33 2025 -0800
[Serve.LLM] Add avg prompt length metric (#58599)
Add avg prompt length metric
When using uniform prompt length (especially in testing), the P50 and
P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
Average prompt length provides another useful dimension to look at and
validate.
For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
9400, and avg accurately shows 5000.
<img width="1186" height="466" alt="image"
src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
/>
---------
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Thu Nov 13 12:42:49 2025 -0800
[release] allowing for py3.13 images (cpu & cu123) in release tests (#58581)
allowing for py3.13 images (cpu & cu123) in release tests
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit c3ba35e6cb1ce4030d8d361a921a697af516fbca
Author: Goutam <goutam@anyscale.com>
Date: Thu Nov 13 12:26:10 2025 -0800
[Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225)
As title suggests
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: Goutam <goutam@anyscale.com>
commit af20446c362a8f4d17b9226d944a3242b0acafaf
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 12:18:38 2025 -0800
[core] fix get_metric_check_condition tests (#58598)
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
which is a non-flaky version of `fetch_prometheus`. Update all of test
usage accordingly.
Test:
- CI
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit f1c613dc386268beec06b6c57c12191218ae7e74
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 12:14:04 2025 -0800
[core] add an option to disable otel sdk error logs (#58257)
Currently, Ray metrics and events are exported through a centralized
process called the Dashboard Agent. This process functions as a gRPC
server, receiving data from all other components (GCS, Raylet, workers,
etc.). However, during a node shutdown, the Dashboard Agent may
terminate before the other components, resulting in gRPC errors and
potential loss of metrics and events.
As this issue occurs, the otel sdk logs become very noisy. Add a default
options to disable otel sdk logs to avoid confusion.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 638933ef4aabe24b5def68d72f21e772e354e853
Author: Abrar Sheikh <abrar@anyscale.com>
Date: Thu Nov 13 11:41:29 2025 -0800
[1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471)
2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns
3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers
4. **Simplified error handling** - not supporting self healing
5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases
**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support
Next PR https://github.com/ray-project/ray/pull/58473
---------
Signed-off-by: abrar <abrar@anyscale.com>
commit 5d5113134bce5929ff7504f733bbee44a7de2987
Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Date: Thu Nov 13 11:21:50 2025 -0800
[Core] Refactor reference_counter out of memory store and plasma store (#57590)
As discovered in the [PR to better define the interface for reference
counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933),
plasma store provider and memory store both share thin dependencies on
reference counter that can be refactored out. This will reduce
entanglement in our code base and improve maintainability.
The main logic changes are located in
* src/ray/core_worker/store_provider/plasma_store_provider.cc, where
reference counter related logic is refactor into core worker
* src/ray/core_worker/core_worker.cc, where factored out reference
counter logic is resolved
* src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
logic related to reference counter has either been removed due to the
fact that it is tech debt or refactored into caller functions.
<!-- Please give a short summary of the change and the problem this
solves. -->
<!-- For example: "Closes #1234" -->
Microbenchmark:
```
single client get calls (Plasma Store) per second 10592.56 +- 535.86
single client put calls (Plasma Store) per second 4908.72 +- 41.55
multi client put calls (Plasma Store) per second 14260.79 +- 265.48
single client put gigabytes per second 11.92 +- 10.21
single client tasks and get batch per second 8.33 +- 0.19
multi client put gigabytes per second 32.09 +- 1.63
single client get object containing 10k refs per second 13.38 +- 0.13
single client wait 1k refs per second 5.04 +- 0.05
single client tasks sync per second 960.45 +- 15.76
single client tasks async per second 7955.16 +- 195.97
multi client tasks async per second 17724.1 +- 856.8
1:1 actor calls sync per second 2251.22 +- 63.93
1:1 actor calls async per second 9342.91 +- 614.74
1:1 actor calls concurrent per second 6427.29 +- 50.3
1:n actor calls async per second 8221.63 +- 167.83
n:n actor calls async per second 22876.04 +- 436.98
n:n actor calls with arg async per second 3531.21 +- 39.38
1:1 async-actor calls sync per second 1581.31 +- 34.01
1:1 async-actor calls async per second 5651.2 +- 222.21
1:1 async-actor calls with args async per second 3618.34 +- 76.02
1:n async-actor calls async per second 7379.2 +- 144.83
n:n async-actor calls async per second 19768.79 +- 211.95
```
This PR mainly makes logic changes to the `ray.get` call chain. As we
can see from the benchmark above, the single clientget calls performance
matches pre-regression levels.
---------
Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Fri Nov 14 00:49:39 2025 +0530
[Core] Support get-auth-token cli command (#58566)
add support for `ray get-auth-token` cli command + test
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Fri Nov 14 00:37:23 2025 +0530
[Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591)
Migrates Ray dashboard authentication from JavaScript-managed cookies to
server-side HttpOnly cookies to enhance security against XSS attacks.
This addresses code review feedback to improve the authentication
implementation (https://github.com/ray-project/ray/pull/58368)
main changes:
- authentication middleware first looks for `Authorization` header, if
not found it then looks at cookies to look for the auth token
- new `api/authenticate` endpoint for verifying token and setting the
auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
`secure=true` (when using https))
- removed javascript based cookie manipulation utils and axios
interceptors (were previously responsible for setting cookies)
- cookies are deleted when connecting to a cluster with
`AUTH_MODE=disabled`. connecting to a different ray cluster (with
different auth token) using the same endpoint (eg due to port-forwarding
or local testing) will reshow the popup and ask users to input the right
token.
---------
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 0905c77db5acd286a6ba84a907c60ad2b15416dd
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:41:57 2025 -0800
[ci] doc check: remove dependency on `ray_ci` (#58516)
this makes it possible to run on a different python version than the CI
wrapper code.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 0bbd8fd22e0447ec66c12e67afc973e95523451b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:35:38 2025 -0800
[ci] mark github.Repository as typechecking (#58582)
so that importing test.py does not always import github
github repo imports jwt, which then imports cryptography and can lead to
issues on windows.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 208970b5b399133a41557db8b16ad6832180e6b7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:35:23 2025 -0800
[wheel] stop building python 3.9 wheels on the pipelines (#58587)
also stops building python 3.9 aarch64 images
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:32:21 2025 -0800
[serve] run tests in python 3.10 (#58586)
all tests are passing
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01
Author: Zac Policzer <zac@anyscale.com>
Date: Thu Nov 13 07:37:52 2025 -0800
[core] Add monitoring in raylet for resouce view (#58382)
We today have very little observability into pubsub. On a raylet one of
the most important states that need to be propagated through the cluster
via pubsub is cluster membership. All raylets should in an eventual BUT
timely fashion agree on the list of available nodes. This metric just
emits a simple counter to keep track of the node count.
More pubsub observability to come.
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: zac <zac@anyscale.com>
Signed-off-by: Zac Policzer <zacattackftw@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit dde70e76e5aa993e9224a2d173a053a35a132ebd
Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com>
Date: Wed Nov 12 23:04:37 2025 -0800
[Data] Fix HTTP streaming file download by using `open_input_stream` (#58542)
Fixes HTTP streaming file downloads in Ray Data's download operation.
Some URIs (especially HTTP streams) require `open_input_stream` instead
of `open_input_file`.
- Modified `download_bytes_threaded` in `plan_download_op.py` to try
both `open_input_file` and `open_input_stream` for each URI
- Improved error handling to distinguish between different error types
- Failed downloads now return `None` gracefully instead of crashing
```
import pyarrow as pa
from ray.data.context import DataContext
from ray.data._internal.planner.plan_download_op import download_bytes_threaded
urls = [
"https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
]
table = pa.table({"url": urls})
ctx = DataContext.get_current()
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
result_table = results[0]
for i in range(result_table.num_rows):
url = result_table['url'][i].as_py()
bytes_data = result_table['bytes'][i].as_py()
if bytes_data is None:
print(f"Row {i}: FAILED (None) - try-catch worked ✓")
else:
print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
print(f" URL: {url[:60]}...")
print("\n✅ Test passed: Failed downloads return None instead of crashing.")
```
Before the fix:
```
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
test_download_expression_with_streaming_fallback()
File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
if not self.__exit__(*sys.exc_info()):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
setattr(self.target, self.attribute, self.temp_original)
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
(base) ray@ip-10-0-39-21:~/default$ python test.py
2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
Traceback (most recent call last):
File "/home/ray/default/test.py", line 16, in <module>
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
uri_bytes = list(
^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
raise item
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
```
After the fix:
```
Row 0: SUCCESS (189370 bytes)
URL: https://static-assets.tesla.com/configurator/compositor?cont...
```
Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
previously failed:
- ✅ Successfully downloads HTTP stream files
- ✅ Gracefully handles failed downloads (returns None)
- ✅ Maintains backward compatibility with existing file downloads
---------
Signed-off-by: xyuzh <xinyzng@gmail.com>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
commit 438d6dcf225b7b03ba75ce9593050971458b94ac
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 22:19:50 2025 -0800
[ci] pin docker client version (#58579)
otherwise, the newer docker client will refuse to communicate with the
docker daemon that is on an older version.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Nov 12 22:08:45 2025 -0800
[deps] adding include_setuptools flag for depset config (#58580)
Adding optional `include_setuptools` flag for depset configuration
If the flag is set on a depset config --unsafe-package setuptools will
not be included for depset compilation
If the flag does not exist (default false) on a depset config
--unsafe-package setuptools will be appended to the default arguments
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 20:36:43 2025 -0800
[serve] remove minbuild-serve-py3.9 (#58585)
nothing is using it anymore
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 0cdbe3f24132c69c4d6ce9322f85de767b660135
Author: Ibrahim Rabbani <irabbani@anyscale.com>
Date: Wed Nov 12 18:48:27 2025 -0800
[core] (cgroups) Use /proc/mounts if mount file is missing. (#58577)
Signed-off-by: irabbani <irabbani@anyscale.com>
commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 18:26:25 2025 -0800
[deps] update `requirements_buildkite.txt` (#58574)
as the pydantic version is pinned in `requirements-doc.txt` now.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 16:38:54 2025 -0800
Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578)
Reverts ray-project/ray#58535
failing on windows.. :(
commit 2f55d078bb69f39198eccf6293683e17a2e72dc5
Author: Goutam <goutam@anyscale.com>
Date: Wed Nov 12 16:37:24 2025 -0800
[Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270)
- Support upserting iceberg tables for IcebergDatasink
- Update schema on APPEND and UPSERT
- Enable overwriting the entire table
Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
functionality. Also for append, the library now handles the transaction
logic implicitly so that burden can be lifted from Ray Data.
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: Goutam <goutam@anyscale.com>
commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f
Author: Joshua Lee <73967497+Sparks0219@users.noreply.github.com>
Date: Wed Nov 12 16:31:26 2025 -0800
[core] Use GetNodeAddressAndLiveness in raylet client pool (#58576)
Using GetNodeAddressAndLiveness in raylet client pool instead of the
bulkier Get, same for AsyncGetAll. Seems like it was already done in
core worker client pool, so just making the same change for raylet
client pool.
Signed-off-by: joshlee <joshlee@anyscale.com>
commit e713b3de319afd437f2de7435f5a2870167fa99a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 15:01:35 2025 -0800
[doc] set default python env to 3.10 (#58570)
we stop supporting building with python 3.9 now
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 15:01:20 2025 -0800
[bazel] rename contraint from hermatic to python_version (#58499)
which is more accurate
also moves python constraint definitions into `bazel/` directory and
registering python 3.10 platform with hermetic toolchain
this allows performing migration from python 3.19 to python 3.10
incrementally
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Nov 12 14:23:17 2025 -0800
[images][deps] raydepsets base extra depset (#58461)
generating depsets for base extra python requirements
Installing requirements in base extra image
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit df65225e4f98bce2b45405b1cf89fb70556e2871
Author: Daniel Shin <88547237+kyuds@users.noreply.github.com>
Date: Thu Nov 13 07:08:15 2025 +0900
[Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371)
Currently Ray Data has a preprocessor called `RobustScaler`. This scales
the data based on given quantiles. Calculating the quantiles involves
sorting the entire dataset by column for each column (C sorts for C
number of columns), which, for a large dataset, will require a lot of
calculations.
** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch`
as I couldn't actually find well-maintained tdigest libraries for
python. ddsketch is better maintained.
** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile`
aggregator
N/A
N/A
---------
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
commit 5e71d58badbfdcfc002826398c3e02469065cc71
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Thu Nov 13 03:33:18 2025 +0530
[Core] support token auth in ray client server (#58557)
support token auth in ray client server by using the existing grpc
interceptors. This pr refactors the code to:
- add/rename sync and async client and server interceptors
- create grpc utils to house grpc channel and server creation logic,
python codebase is updated to use these methods
- separate tests for sync and async interceptors
- make existing authentication integration tests to run with RAY_CLIENT
mode
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd
Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Date: Wed Nov 12 13:45:02 2025 -0800
[Core] Move request id creation to worker to address plasma get perf regression (#58390)
This PR address the performance regression introduced in the [PR to make
ray.get thread safe](https://github.com/ray-project/ray/pull/57911).
Specifically, the previous PR requires the worker to block and wait for
AsyncGet to return with a reply of the request id needed for correctly
cleaning up get requests. This additional synchronous step causes the
plasma store Get to regress in performance.
This PR moves the request id generation step to the plasma store,
removing the blocking step to fix the perf regression.
- [PR which introduced perf
regression](https://github.com/ray-project/ray/pull/57911)
- [PR which observed the
regression](https://github.com/ray-project/ray/pull/58175)
New performance of the change measured by `ray microbenchmark`.
<img width="485" height="17" alt="image"
src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0"
/>
Original performance prior to the change. Here we focus on the
regressing `single client get calls (Plasma Store)` metric, where our
new performance returns us back to the original 10k per second range
compared to the existing sub 5k per second.
<img width="811" height="355" alt="image"
src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c"
/>
---------
Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
commit 9e450e6805824ac825488e1455ac97f93df0bbc3
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 12:36:21 2025 -0800
[doc] symlink the doc dependency lock file (#58520)
and ask people to use that lock file for building docs.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4
Author: Lehui Liu <lehui@anyscale.com>
Date: Wed Nov 12 12:08:28 2025 -0800
[train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783)
1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set
to initialize jax.distributed:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38
2. Before this change, user will have to configure both `use_tpu=True`
in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able
to start jax.distributed. `JAX_PLATFORMS` can be comma separated string.
3. If user uses other jax.distributed libraries like Orbax, sometimes,
it will leads to misleading error about distributed initialization.
4. After this change, if user sets `use_tpu=True`, we automatically add
this to env var.
5. tpu unit test is not available this time, will explore for how to
cover it later.
---------
Signed-off-by: Lehui Liu <lehui@anyscale.com>
commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Wed Nov 12 12:04:16 2025 -0800
[Data] Add `Ranker` Interface (#58513)
Creates a ranker interface that will rank the best operator to run next
in `select_operator_to_run`. This code only refractors the existing
code. The ranking value must be something that is comparable.
None
None
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 11:50:42 2025 -0800
[bazel] upgrade bazel python rules to 0.25.0 (#58535)
previously it was actually using 0.4.0, which is set up by the grpc
repo. the declaration in the workspace file was being shadowed..
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 02afe68937429bfd6501e4d0f46780bca4dea329
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Wed Nov 12 11:34:59 2025 -0800
[Data] Refactor concurrency validation tests in `test_map.py` (#58549)
The original `test_concurrency` function combined multiple test
scenarios into a single test with complex control flow and expensive Ray
cluster initialization. This refactoring extracts the parameter
validation tests into focused, independent tests that are faster,
clearer, and easier to maintain.
Additionally, the original test included "validation" cases that tested
valid concurrency parameters but didn't actually verify that concurrency
was being limited correctly—they only checked that the output was
correct, which isn't useful for validating the concurrency feature
itself.
**Key improvements:**
- Split validation tests into `test_invalid_func_concurrency_raises` and
`test_invalid_class_concurrency_raises`
- Use parametrized tests for different invalid concurrency values
- Switch from `shutdown_only` with explicit `ray.init()` to
`ray_start_regular_shared` to eliminate cluster initialization overhead
- Minimize test data from 10 blocks to 1 element since we're only
validating parameter errors
- Remove non-validation tests that didn't verify concurrency behavior
N/A
The validation tests now execute significantly faster and provide
clearer failure messages. Each test has a single, well-defined purpose
making maintenance and debugging easier.
---------
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Wed Nov 12 11:32:48 2025 -0800
[Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523)
This PR improves documentation consistency in the `python/ray/data`
module by converting all remaining rST-style docstrings (`:param:`,
`:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).
**Files modified:**
- `python/ray/data/preprocessors/utils.py` - Converted
`StatComputationPlan.add_callable_stat()`
- `python/ray/data/preprocessors/encoder.py` - Converted
`unique_post_fn()`
- `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
and `BlockColumnAccessor.is_composed_of_lists()`
- `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 7e872837e450411e9da45acea0c52f4b67221500
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Wed Nov 12 09:07:32 2025 -0800
[serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
commit cd09d104f6d595a805fd8f9979d9f81a828823b5
Author: Alexey Kudinkin <ak@anyscale.com>
Date: Wed Nov 12 11:50:05 2025 -0500
[Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262)
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.
> ⚠️ Remove these instructions before submitting your PR.
> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.
This was setting the value to be aligned with the previous default of 4.
However, after some consideration i've realized that 4 is too high of a
number so actually lowering this to 2
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
commit 126a40bc711cf06ed44686ee5026624d6b78766e
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Wed Nov 12 07:44:53 2025 -0800
[core] fix idle node termination on object pulling (#57928)
Currently, a node is considered idle while pulling objects from the
remote object store. This can lead to situations where a node is
terminated as idle, causing the cluster to enter an infinite loop when
pulling large objects that exceed the node idle termination timeout.
This PR fixes the issue by treating object pulling as a busy activity.
Note that nodes can still accept additional tasks while pulling objects
(since pulling consumes no resources), but the auto-scaler will no
longer terminate the node prematurely.
Closes #54372
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit ad8f30291137efce9e463fb23e6821f4c7c74a9c
Author: Sagar Sumit <sagarsumit09@gmail.com>
Date: Wed Nov 12 05:40:47 2025 -0800
[core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090)
When actors terminate gracefully, Ray calls the actor's
`__ray_shutdown__()` method if defined, allowing for cleanup of
resources. But, this is not invoked in case actor goes out of scope due
to `del actor`.
Traced through the entire code path, and here's what happens:
Flow when `del actor` is called:
1. **Python side**: `ActorHandle.__del__()` ->
`worker.core_worker.remove_actor_handle_reference(actor_id)`
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040
2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
`reference_counter_->RemoveLocalReference()`
- When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
callback
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506
3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
`AsyncReportActorOutOfScope()` to GCS
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51
4. **GCS receives notification**: `HandleReportActorOutOfScope()`
- **THE PROBLEM IS HERE** ([line 279 in
`src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
```cpp
DestroyActor(actor_id,
GenActorOutOfScopeCause(actor),
/*force_kill=*/true, // <-- HARDCODED TO TRUE!
[reply, send_reply_callback]() {
```
5. **Actor worker receives kill signal**: `HandleKillActor()` in
[`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
```cpp
if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE
ForceExit(...) // Skips __ray_shutdown__
} else {
Exit(...) // Would call __ray_shutdown__
}
```
6. **ForceExit path**: Bypasses graceful shutdown -> No
`__ray_shutdown__` callback invoked.
This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
actors. Also, updated the docs.
---------
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
commit 15393edbe72f5079279d3a0e46b72adc7496cdfc
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Wed Nov 12 19:00:10 2025 +0530
[Core] use client interceptor for adding auth token in c++ client calls (#58424)
- Use client interceptor for adding auth tokens in grpc calls when
`AUTH_MODE=token`
- BuildChannel() will automatically include the interceptor
- Removed `auth_token` parameter from `ClientCallImpl`
- removed manual auth from `python_gcs_subscriber`.cc
- tests to verify auth works for autoscaller apis
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit d496ea87808706333703be6ff25ecc9472330fd5
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Wed Nov 12 11:25:11 2025 +0530
[core] Token auth usability improvements (#58408)
- rename RAY_auth_mode → RAY_AUTH_MODE environment variable across
codebase
- Excluded healthcheck endpoints from authentication for Kubernetes
compatibility
- Fixed dashboard cookie handling to respect auth mode and clear stale
tokens when switching clusters
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b
Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Date: Tue Nov 11 19:50:52 2025 -0800
[doc][serve][llm] Attached the correct figure to the pd docs (#58543)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
commit a15f5be797ced0df321bfd8d42bab7d57defa2de
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Tue Nov 11 18:00:43 2025 -0800
[doc] downgrade readthedocs to use python 3.10 (#58536)
be consistent with the default build environment
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 17:26:15 2025 -0800
[core] Fix auth test import (#58554)
The python test step is failing on master now because of this. Probably
a logical merge conflict.
```
FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
...
[2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import (
--
| [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
```
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 20bf68263beed3609e24aede3d9fc96bc07f0da0
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:44:05 2025 -0800
[core][rdt] Abort NIXL and allow actor reuse on failed transfers (#56783)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 89a329cd1e0219629132abc203085117a11949f3
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:26:17 2025 -0800
[core] Improve kill actor logs (#58544)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 6c9607ea57b9edde07c856f094835c84f47b79a6
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Tue Nov 11 12:16:41 2025 -0800
[docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
commit 711d9453828fecebb91b9642e799b4b0b4a493f7
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:13:13 2025 -0800
[core] Make GlobalState lazy initialization thread-safe (#58182)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date: Tue Nov 11 09:43:05 2025 -0800
[core] Scheduling a detached actor with a placement group is not recommended (#57726)
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->
If users schedule a detached actor into a placement group, Raylet will
kill the actor when the placement group is removed. The actor will be
stuck in the `RESTARTING` state forever if it's restartable until users
explicitly kill it.
In that case, if users try to `get_actor` with the actor's name, it can
still return the restarting actor, but no process exists. It will no
longer be restarted because the PG is gone, and no PG with the same ID
will be created during the cluster's lifetime.
The better behavior would be for Ray to transition a task/actor's state
to dead when it is impossible to restart. However, this would add too
much complexity to the core, so I think it's not worth it. Therefore,
this PR adds a warning log, and users should use detached actors or PGs
correctly.
Example: Run the following script and run `ray list actors`.
```python
import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from ray.util.placement_group import placement_group, remove_placement_group
@ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
class Actor:
pass
ray.init()
pg = placement_group([{"CPU": 1}])
ray.get(pg.ready())
actor = Actor.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=pg,
)
).remote()
ray.get(actor.__ray_ready__.remote())
```
<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to
- [ ] Bug fix 🐛
- [ ] New feature ✨
- [x] Enhancement 🚀
- [ ] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨
**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->
**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_
**Code Quality:**
- [x] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)
<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->
---------
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 0752886e7d55694b6cf8d780b7470d58266c6a10
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Tue Nov 11 07:19:19 2025 -0800
[core] enable open telemetry by default (#56432)
This PR enables open telemetry as the default backend for ray metric
stack. The bulk of this PR is actually to fix tests that were written
with some assumptions that no longer hold true. For ease of reviewing, I
inline the reasons for the change together with the change for each
tests in the comments.
This PR also depends on a release of vllm (so that we can update the
minimal supported version of vllm in ray).
Test:
- CI
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Enable OpenTelemetry metrics backend by default and refactor
metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
metric names.
>
> - **Core/Config**:
> - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
`true` in `ray_constants.py` and `ray_config_def.h`.
> - Metrics `Counter`: use `CythonCount` by default; keep legacy
`CythonSum` only when OTEL is explicitly disabled.
> - **Serve/Metrics Tests**:
> - Replace text scraping with `PrometheusTimeseries` and
`fetch_prometheus_metric_timeseries` throughout.
> - Update metric names/tags to `ray_serve_*` and counter suffixes
`*_total`; adjust latency metric names and processing/queued gauges.
> - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
pass through helpers.
> - **General Test Fixes**:
> - Remove OTEL parametrization/fixtures; simplify expectations where
counters-as-gauges no longer apply; drop related tests.
> - Cardinality tests: include `"low"` level and remove OTEL gating;
stop injecting `enable_open_telemetry` in system config.
> - Actor/state/thread tests: migrate to cluster fixtures, wait for
dashboard agent, and adjust expected worker thread counts.
> - **Build**:
> - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
from C++ stats test.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit bf595e32d049503f5c1931c5b477647a06d191c2
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Tue Nov 11 19:15:41 2025 +0530
[Core] move authentication_test_utils into ray._private to fix macos tests (#58528)
the auth token test setup in `conftest.py` is breaking macos test. there
are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
that run after the wheel is installed but without editable mode. for
these test to pass,` conftest.py` cannot import anything under
`ray.tests`.
this pr moves `authentication_test_utils` into `ray._private` to fix
this issue
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Tue Nov 11 19:10:56 2025 +0530
[Core] Add Service Interceptor to support token authentication in dashboard agent (#58405)
Add a grpc service interceptor to intercept all dashboard agent rpc
calls and validate the presence of auth token (when auth mode is token)
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 1a48e7318442d038f2c43d22da3b580fa643b8d1
Author: curiosity-hyf <curiooosity.h@gmail.com>
Date: Tue Nov 11 21:35:42 2025 +0800
[Docs] fix pattern_async_actor demo typo (#58486)
fix pattern_async_actor demo typo. Add `self.`.
---------
Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>
commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date: Mon Nov 10 18:28:46 2025 -0800
Update pydoclint to version 0.8.1 (#58490)
* Does the work to bump pydoclint up to the latest version
* And allowlist any new violations it finds
n/a
n/a
---------
Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>
commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1
Author: Goutam <goutam@anyscale.com>
Date: Mon Nov 10 17:34:13 2025 -0800
[Data] - Iceberg support predicate & projection pushdown (#58286)
Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in
conjunction with this PR should speed up reads from Iceberg.
Once the above change lands, we can add the pushdown interface support
for IcebergDatasource
---------
Signed-off-by: Goutam <goutam@anyscale.com>
commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Mon Nov 10 17:28:23 2025 -0800
[serve][llm] Fix import path in muli-node release test (#58498)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
commit 405c4648c2fe71afb7daf4ea574605190f129fd7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 16:04:48 2025 -0800
[ci] upgrade rayci version (#58514)
to 0.21.0; supports wanda priority now.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 6de012fd0df23993054653ca5517a66944c58dd2
Author: Zac Policzer <zac@anyscale.com>
Date: Mon Nov 10 14:05:15 2025 -0800
[core] Add owned object spill metrics (#57870)
This PR adds 2 new metrics to core_worker by way of the reference
counter. The two new metrics keep track of the count and size of objects
owned by the worker as well as keeping track of their states. States are
defined as:
- **PendingCreation**: An object that is pending creation and hasn't
finished it's initialization (and is sizeless)
- **InPlasma**: An object which has an assigned node address and isn't
spilled
- **Spilled**: An object which has an assigned node address and is
spilled
- **InMemory**: An object which has no assigned address but isn't
pending creation (and therefore, must be local)
The approach used by these new metrics is to examine the state 'before
and after' any mutations on the reference in the reference_counter. This
is required in order to do the appropriate bookkeeping (decrementing
values and incrementing others). Admittedly, there is potential for
counting on the in between decrements/increments depending on when the
RecordMetrics loop is run. This unfortunate side effect however seems
preferable to doing mutual exclusion with metric collection as this is
potentially a high throughput code path.
In addition, performing live counts seemed preferable then doing full
accounting of the object store and across all references at time of
metric collection. Reason being, that potentially the reference counter
is tracking millions of objects, and each metric scan could potentially
be very expensive. So running the accounting (despite being potentially
innaccurate for short periods) seemed the right call.
This PR also allows for object size to potentially change due to
potential non deterministic instantation (say an object is initially
created, but it's primary copy dies, and then the recreation fails).
This is an edge case, but seems important for completeness sake.
---------
Signed-off-by: zac <zac@anyscale.com>
commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:02:11 2025 -0800
[java] remove local genrule `//java:ray_java_pkg` (#58503)
using `bazelisk run //java:gen_ray_java_pkg` everywhere
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b23adc777c5b103291cf3a35b51b123a808d36f6
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:01:27 2025 -0800
[ci] apply isort to release test directory, part 1 (#58505)
excluding `*_tests` directories for now to reduce the impact
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:01:06 2025 -0800
[doc] change link check to run on python 3.12 (#58506)
migrating all doc related things to run on python 3.12
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b09b076e15fefe842a0b7e33accff71ec3c31435
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:00:01 2025 -0800
[doc] ci: move doc annotation check to python 3.12 (#58507)
be consistent with doc build environment
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 8971f83ecb40d54729c2c26d394594c29199e19d
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Mon Nov 10 12:52:43 2025 -0800
[data] Clear queue for manually mark_execution_finished operators (#58441)
Currently, we clear _external_ queues when an operator is manually
marked as finished. But we don't clear their _internal_ queues. This PR
fixes that
Fixes this test
https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit ffb51f866802ad3858d82a9356855a38503efec9
Author: Matthew Owen <mowen@anyscale.com>
Date: Mon Nov 10 10:54:34 2025 -0800
[data] Update depsets for multimodal inference release tests (#57233)
Update remaining mulitmodal release tests to use new depsets.
commit 62231dd4ba8e784da8800b248ad7616b8db92de7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 10:30:00 2025 -0800
[ci] seperate doc related jobs into its own group (#58454)
so that they are not called lints any more
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2
Author: harshit-anyscale <harshit@anyscale.com>
Date: Mon Nov 10 23:45:38 2025 +0530
increase timeout for test_initial_replica tests (#58423)
- `test_target_capacity` windows test is failing, possibly because we
have put up a short timeout of 10 seconds, increasing it to verify
whether timeout is an issue or not.
Signed-off-by: harshit <harshit@anyscale.com>
commit 217031a48f4f83d04950ad39b94846ba362edd37
Author: Jugal Shah <47508441+jugalshah291@users.noreply.github.com>
Date: Mon Nov 10 09:39:43 2025 -0800
Define an env for controlling UVloop (#58442)
> Briefly describe what this PR accomplishes and why it's needed.
Our serve ingress keeps running into below error related to `uvloop`
under heavy load
```
File descriptor 97 is used by transport
```
The uvloop team have a
[PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems
like no one is working on it
One of workaround mentioned in the
([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982))
is to just turn off uvloop .
We tried it in our env and didn't see any major performance difference
Hence as part of this PR, we are defining a new env for controlling
UVloop
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
commit 2486ddd9fec83cc940937e3d91368942588ef177
Author: fscnick <6858627+fscnick@users.noreply.github.com>
Date: Mon Nov 10 23:29:03 2025 +0800
[Doc][KubeRay] eliminate vale errors (#58429)
Fix some vale's error and suggestions on the kai-scheduler document.
See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719
Signed-off-by: fscnick <fscnick.dev@gmail.com>
commit cb6a60d0afcfca87734a399291343e297031f1d5
Author: Daniel Sperber <github.blurry@9ox.net>
Date: Mon Nov 10 16:24:34 2025 +0100
[air] Add stacklevel option to deprecation_warning (#58357)
Currently are deprecation warnings sometimes not informative enough. The
the warning is triggered it does not tell us *where* the deprecated
feature is used. For example, ray internally raises a deprecation
warning when an `RLModuleConfig` is initialized.
```python
>>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig
>>> RLModuleConfig()
2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
```
This is confusing, where did *I* use a config, what am I doing wrong?
This raises issues like:
https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064
Tracing where the error actually happens is tedious - is it my code or
internal? The output just shows `deprecation.:50`. Not helpful.
This PR adds a stacklevel option with stacklevel=2 as the default to all
`deprecation_warning`s. So devs and users can better see where is the
deprecated option actually used.
---
EDIT:
**Before**
```python
WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])`
```
**After** module.py:line where the deprecated artifact is used is shown
in the log output:
When building an Algorithm:
```python
WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
```
```python
.../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
```
Signed-off-by: Daraan <github.blurry@9ox.net>
commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Mon Nov 10 20:50:21 2025 +0530
[core] Configure an interceptor to pass auth token in python direct g… (#58395)
there are places in the python code where we use the raw grpc library to
make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term
we want to fully deprecate grpc library usage in our python code base
but as that can take more effort and testing, in this pr I am
introducing an interceptor to add auth headers (this will take effect
for all grpc calls made from python).
```
export RAY_auth_mode="token"
export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
ray start --head
ray job submit -- echo "hi"
```
output
```
ray job submit -- echo "hi"
2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads.
Job submission server address: http://127.0.0.1:8265
-------------------------------------------------------
Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_1EV8q86uKM24nHmH
Query the status of the job:
ray job status raysubmit_1EV8q86uKM24nHmH
Request the job to be stopped:
ray job stop raysubmit_1EV8q86uKM24nHmH
Tailing logs until the job exits (disable with --no-wait):
2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up.
hi
Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi
------------------------------------------
Job 'raysubmit_1EV8q86uKM24nHmH' succeeded
------------------------------------------
```
dashboard
test.py
```python
import time
import ray
from ray._raylet import Config
ray.init()
@ray.remote
def print_hi():
print("Hi")
time.sleep(2)
@ray.remote
class SimpleActor:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
actor = SimpleActor.remote()
result = ray.get(actor.increment.remote())
for i in range(100):
ray.get(print_hi.remote())
time.sleep(20)
ray.shutdown()
```
```
export RAY_auth_mode="token"
export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
python test.py
```
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292"
/>
overview page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762"
/>
job page: tasks are listed
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a"
/>
task page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136"
/>
actors page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459"
/>
specific actor page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0"
/>
---------
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f
Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com>
Date: Sun Nov 9 08:34:46 2025 -0800
[Data] Add exception handling for invalid URIs in download operation (#58464)
commit d74c1570543045a0f99df4d5690ac44f1fda4a55
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Sat Nov 8 15:35:11 2025 -0800
[dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384)
Pass in `status_code` directly into `do_reply`. This is a follow up to
https://github.com/ray-project/ray/pull/58255
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit e793631896f65a88513510b4e7bf6f100607cb03
Author: Rueian <rueiancsie@gmail.com>
Date: Sat Nov 8 15:32:10 2025 -0800
[core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)
This ensures node type names are correctly reported even when the
autoscaler is disabled (read-only mode).
Autoscaler v2 fails to report prometheus metrics when operating in
read-only mode on KubeRay with the following KeyError error:
```
2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group'
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
return Reconciler.reconcile(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
Reconciler._step_next(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
Reconciler._scale_cluster(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
reply = scheduler.schedule(sched_request)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
node_config = ctx.get_node_type_configs()[node_type]
KeyError: 'small-group'
```
This happens because the `ReadOnlyProviderConfigReader` populates
`ctx.get_node_type_configs()` using node IDs as node types, which is
correct for local Ray (where local ray does not have
`RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
`ray_node_type_name` is present and expected wi…
commit b3a8434d35f7af0322e3b766b1a1809bd29c2837
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 14:31:31 2025 -0800
[doc] remove python 3.12 in doc building (#58572)
unifying to python 3.10
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 31f904f630809152ceba67c8bf1684c8c9b685ea
Author: Andrew Sy Kim <andrewsy@google.com>
Date: Thu Nov 13 17:27:23 2025 -0500
Add support for RAY_AUTH_MODE=k8s (#58497)
This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
will delegate authentication and authorization of Ray access to
Kubernetes TokenReview and SubjectAccessReview APIs.
---------
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
commit ade535a9519c19c25aa50c562d2c27128b3ca356
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 14:08:29 2025 -0800
[serve] fix serve dashboard metric name (#58573)
Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
historically has been supported counter metric with and without `_total`
suffix for backward compatibility, but it is now time to drop the
support (2 years since the warning was added).
There is one place in ray serve dashboard that still doesn't use the
`_total` suffix so fix it in this PR.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Thu Nov 13 13:33:33 2025 -0800
[Serve.LLM] Add avg prompt length metric (#58599)
Add avg prompt length metric
When using uniform prompt length (especially in testing), the P50 and
P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
Average prompt length provides another useful dimension to look at and
validate.
For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
9400, and avg accurately shows 5000.
<img width="1186" height="466" alt="image"
src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
/>
---------
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Thu Nov 13 12:42:49 2025 -0800
[release] allowing for py3.13 images (cpu & cu123) in release tests (#58581)
allowing for py3.13 images (cpu & cu123) in release tests
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit c3ba35e6cb1ce4030d8d361a921a697af516fbca
Author: Goutam <goutam@anyscale.com>
Date: Thu Nov 13 12:26:10 2025 -0800
[Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225)
As title suggests
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: Goutam <goutam@anyscale.com>
commit af20446c362a8f4d17b9226d944a3242b0acafaf
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 12:18:38 2025 -0800
[core] fix get_metric_check_condition tests (#58598)
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
which is a non-flaky version of `fetch_prometheus`. Update all of test
usage accordingly.
Test:
- CI
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit f1c613dc386268beec06b6c57c12191218ae7e74
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Nov 13 12:14:04 2025 -0800
[core] add an option to disable otel sdk error logs (#58257)
Currently, Ray metrics and events are exported through a centralized
process called the Dashboard Agent. This process functions as a gRPC
server, receiving data from all other components (GCS, Raylet, workers,
etc.). However, during a node shutdown, the Dashboard Agent may
terminate before the other components, resulting in gRPC errors and
potential loss of metrics and events.
As this issue occurs, the otel sdk logs become very noisy. Add a default
options to disable otel sdk logs to avoid confusion.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 638933ef4aabe24b5def68d72f21e772e354e853
Author: Abrar Sheikh <abrar@anyscale.com>
Date: Thu Nov 13 11:41:29 2025 -0800
[1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471)
2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns
3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers
4. **Simplified error handling** - not supporting self healing
5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases
**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support
Next PR https://github.com/ray-project/ray/pull/58473
---------
Signed-off-by: abrar <abrar@anyscale.com>
commit 5d5113134bce5929ff7504f733bbee44a7de2987
Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Date: Thu Nov 13 11:21:50 2025 -0800
[Core] Refactor reference_counter out of memory store and plasma store (#57590)
As discovered in the [PR to better define the interface for reference
counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933),
plasma store provider and memory store both share thin dependencies on
reference counter that can be refactored out. This will reduce
entanglement in our code base and improve maintainability.
The main logic changes are located in
* src/ray/core_worker/store_provider/plasma_store_provider.cc, where
reference counter related logic is refactor into core worker
* src/ray/core_worker/core_worker.cc, where factored out reference
counter logic is resolved
* src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
logic related to reference counter has either been removed due to the
fact that it is tech debt or refactored into caller functions.
<!-- Please give a short summary of the change and the problem this
solves. -->
<!-- For example: "Closes #1234" -->
Microbenchmark:
```
single client get calls (Plasma Store) per second 10592.56 +- 535.86
single client put calls (Plasma Store) per second 4908.72 +- 41.55
multi client put calls (Plasma Store) per second 14260.79 +- 265.48
single client put gigabytes per second 11.92 +- 10.21
single client tasks and get batch per second 8.33 +- 0.19
multi client put gigabytes per second 32.09 +- 1.63
single client get object containing 10k refs per second 13.38 +- 0.13
single client wait 1k refs per second 5.04 +- 0.05
single client tasks sync per second 960.45 +- 15.76
single client tasks async per second 7955.16 +- 195.97
multi client tasks async per second 17724.1 +- 856.8
1:1 actor calls sync per second 2251.22 +- 63.93
1:1 actor calls async per second 9342.91 +- 614.74
1:1 actor calls concurrent per second 6427.29 +- 50.3
1:n actor calls async per second 8221.63 +- 167.83
n:n actor calls async per second 22876.04 +- 436.98
n:n actor calls with arg async per second 3531.21 +- 39.38
1:1 async-actor calls sync per second 1581.31 +- 34.01
1:1 async-actor calls async per second 5651.2 +- 222.21
1:1 async-actor calls with args async per second 3618.34 +- 76.02
1:n async-actor calls async per second 7379.2 +- 144.83
n:n async-actor calls async per second 19768.79 +- 211.95
```
This PR mainly makes logic changes to the `ray.get` call chain. As we
can see from the benchmark above, the single clientget calls performance
matches pre-regression levels.
---------
Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com>
commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Fri Nov 14 00:49:39 2025 +0530
[Core] Support get-auth-token cli command (#58566)
add support for `ray get-auth-token` cli command + test
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Fri Nov 14 00:37:23 2025 +0530
[Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591)
Migrates Ray dashboard authentication from JavaScript-managed cookies to
server-side HttpOnly cookies to enhance security against XSS attacks.
This addresses code review feedback to improve the authentication
implementation (https://github.com/ray-project/ray/pull/58368)
main changes:
- authentication middleware first looks for `Authorization` header, if
not found it then looks at cookies to look for the auth token
- new `api/authenticate` endpoint for verifying token and setting the
auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
`secure=true` (when using https))
- removed javascript based cookie manipulation utils and axios
interceptors (were previously responsible for setting cookies)
- cookies are deleted when connecting to a cluster with
`AUTH_MODE=disabled`. connecting to a different ray cluster (with
different auth token) using the same endpoint (eg due to port-forwarding
or local testing) will reshow the popup and ask users to input the right
token.
---------
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 0905c77db5acd286a6ba84a907c60ad2b15416dd
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:41:57 2025 -0800
[ci] doc check: remove dependency on `ray_ci` (#58516)
this makes it possible to run on a different python version than the CI
wrapper code.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 0bbd8fd22e0447ec66c12e67afc973e95523451b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:35:38 2025 -0800
[ci] mark github.Repository as typechecking (#58582)
so that importing test.py does not always import github
github repo imports jwt, which then imports cryptography and can lead to
issues on windows.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 208970b5b399133a41557db8b16ad6832180e6b7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:35:23 2025 -0800
[wheel] stop building python 3.9 wheels on the pipelines (#58587)
also stops building python 3.9 aarch64 images
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Nov 13 10:32:21 2025 -0800
[serve] run tests in python 3.10 (#58586)
all tests are passing
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01
Author: Zac Policzer <zac@anyscale.com>
Date: Thu Nov 13 07:37:52 2025 -0800
[core] Add monitoring in raylet for resouce view (#58382)
We today have very little observability into pubsub. On a raylet one of
the most important states that need to be propagated through the cluster
via pubsub is cluster membership. All raylets should in an eventual BUT
timely fashion agree on the list of available nodes. This metric just
emits a simple counter to keep track of the node count.
More pubsub observability to come.
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: zac <zac@anyscale.com>
Signed-off-by: Zac Policzer <zacattackftw@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit dde70e76e5aa993e9224a2d173a053a35a132ebd
Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com>
Date: Wed Nov 12 23:04:37 2025 -0800
[Data] Fix HTTP streaming file download by using `open_input_stream` (#58542)
Fixes HTTP streaming file downloads in Ray Data's download operation.
Some URIs (especially HTTP streams) require `open_input_stream` instead
of `open_input_file`.
- Modified `download_bytes_threaded` in `plan_download_op.py` to try
both `open_input_file` and `open_input_stream` for each URI
- Improved error handling to distinguish between different error types
- Failed downloads now return `None` gracefully instead of crashing
```
import pyarrow as pa
from ray.data.context import DataContext
from ray.data._internal.planner.plan_download_op import download_bytes_threaded
urls = [
"https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
]
table = pa.table({"url": urls})
ctx = DataContext.get_current()
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
result_table = results[0]
for i in range(result_table.num_rows):
url = result_table['url'][i].as_py()
bytes_data = result_table['bytes'][i].as_py()
if bytes_data is None:
print(f"Row {i}: FAILED (None) - try-catch worked ✓")
else:
print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
print(f" URL: {url[:60]}...")
print("\n✅ Test passed: Failed downloads return None instead of crashing.")
```
Before the fix:
```
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
test_download_expression_with_streaming_fallback()
File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
if not self.__exit__(*sys.exc_info()):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
setattr(self.target, self.attribute, self.temp_original)
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
(base) ray@ip-10-0-39-21:~/default$ python test.py
2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
Traceback (most recent call last):
File "/home/ray/default/test.py", line 16, in <module>
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
uri_bytes = list(
^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
raise item
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
```
After the fix:
```
Row 0: SUCCESS (189370 bytes)
URL: https://static-assets.tesla.com/configurator/compositor?cont...
```
Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
previously failed:
- ✅ Successfully downloads HTTP stream files
- ✅ Gracefully handles failed downloads (returns None)
- ✅ Maintains backward compatibility with existing file downloads
---------
Signed-off-by: xyuzh <xinyzng@gmail.com>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
commit 438d6dcf225b7b03ba75ce9593050971458b94ac
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 22:19:50 2025 -0800
[ci] pin docker client version (#58579)
otherwise, the newer docker client will refuse to communicate with the
docker daemon that is on an older version.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Nov 12 22:08:45 2025 -0800
[deps] adding include_setuptools flag for depset config (#58580)
Adding optional `include_setuptools` flag for depset configuration
If the flag is set on a depset config --unsafe-package setuptools will
not be included for depset compilation
If the flag does not exist (default false) on a depset config
--unsafe-package setuptools will be appended to the default arguments
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 20:36:43 2025 -0800
[serve] remove minbuild-serve-py3.9 (#58585)
nothing is using it anymore
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 0cdbe3f24132c69c4d6ce9322f85de767b660135
Author: Ibrahim Rabbani <irabbani@anyscale.com>
Date: Wed Nov 12 18:48:27 2025 -0800
[core] (cgroups) Use /proc/mounts if mount file is missing. (#58577)
Signed-off-by: irabbani <irabbani@anyscale.com>
commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 18:26:25 2025 -0800
[deps] update `requirements_buildkite.txt` (#58574)
as the pydantic version is pinned in `requirements-doc.txt` now.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 16:38:54 2025 -0800
Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578)
Reverts ray-project/ray#58535
failing on windows.. :(
commit 2f55d078bb69f39198eccf6293683e17a2e72dc5
Author: Goutam <goutam@anyscale.com>
Date: Wed Nov 12 16:37:24 2025 -0800
[Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270)
- Support upserting iceberg tables for IcebergDatasink
- Update schema on APPEND and UPSERT
- Enable overwriting the entire table
Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
functionality. Also for append, the library now handles the transaction
logic implicitly so that burden can be lifted from Ray Data.
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: Goutam <goutam@anyscale.com>
commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f
Author: Joshua Lee <73967497+Sparks0219@users.noreply.github.com>
Date: Wed Nov 12 16:31:26 2025 -0800
[core] Use GetNodeAddressAndLiveness in raylet client pool (#58576)
Using GetNodeAddressAndLiveness in raylet client pool instead of the
bulkier Get, same for AsyncGetAll. Seems like it was already done in
core worker client pool, so just making the same change for raylet
client pool.
Signed-off-by: joshlee <joshlee@anyscale.com>
commit e713b3de319afd437f2de7435f5a2870167fa99a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 15:01:35 2025 -0800
[doc] set default python env to 3.10 (#58570)
we stop supporting building with python 3.9 now
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 15:01:20 2025 -0800
[bazel] rename contraint from hermatic to python_version (#58499)
which is more accurate
also moves python constraint definitions into `bazel/` directory and
registering python 3.10 platform with hermetic toolchain
this allows performing migration from python 3.19 to python 3.10
incrementally
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Nov 12 14:23:17 2025 -0800
[images][deps] raydepsets base extra depset (#58461)
generating depsets for base extra python requirements
Installing requirements in base extra image
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit df65225e4f98bce2b45405b1cf89fb70556e2871
Author: Daniel Shin <88547237+kyuds@users.noreply.github.com>
Date: Thu Nov 13 07:08:15 2025 +0900
[Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371)
Currently Ray Data has a preprocessor called `RobustScaler`. This scales
the data based on given quantiles. Calculating the quantiles involves
sorting the entire dataset by column for each column (C sorts for C
number of columns), which, for a large dataset, will require a lot of
calculations.
** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch`
as I couldn't actually find well-maintained tdigest libraries for
python. ddsketch is better maintained.
** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile`
aggregator
N/A
N/A
---------
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
commit 5e71d58badbfdcfc002826398c3e02469065cc71
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Thu Nov 13 03:33:18 2025 +0530
[Core] support token auth in ray client server (#58557)
support token auth in ray client server by using the existing grpc
interceptors. This pr refactors the code to:
- add/rename sync and async client and server interceptors
- create grpc utils to house grpc channel and server creation logic,
python codebase is updated to use these methods
- separate tests for sync and async interceptors
- make existing authentication integration tests to run with RAY_CLIENT
mode
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd
Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Date: Wed Nov 12 13:45:02 2025 -0800
[Core] Move request id creation to worker to address plasma get perf regression (#58390)
This PR address the performance regression introduced in the [PR to make
ray.get thread safe](https://github.com/ray-project/ray/pull/57911).
Specifically, the previous PR requires the worker to block and wait for
AsyncGet to return with a reply of the request id needed for correctly
cleaning up get requests. This additional synchronous step causes the
plasma store Get to regress in performance.
This PR moves the request id generation step to the plasma store,
removing the blocking step to fix the perf regression.
- [PR which introduced perf
regression](https://github.com/ray-project/ray/pull/57911)
- [PR which observed the
regression](https://github.com/ray-project/ray/pull/58175)
New performance of the change measured by `ray microbenchmark`.
<img width="485" height="17" alt="image"
src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0"
/>
Original performance prior to the change. Here we focus on the
regressing `single client get calls (Plasma Store)` metric, where our
new performance returns us back to the original 10k per second range
compared to the existing sub 5k per second.
<img width="811" height="355" alt="image"
src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c"
/>
---------
Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
commit 9e450e6805824ac825488e1455ac97f93df0bbc3
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 12:36:21 2025 -0800
[doc] symlink the doc dependency lock file (#58520)
and ask people to use that lock file for building docs.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4
Author: Lehui Liu <lehui@anyscale.com>
Date: Wed Nov 12 12:08:28 2025 -0800
[train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783)
1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set
to initialize jax.distributed:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38
2. Before this change, user will have to configure both `use_tpu=True`
in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able
to start jax.distributed. `JAX_PLATFORMS` can be comma separated string.
3. If user uses other jax.distributed libraries like Orbax, sometimes,
it will leads to misleading error about distributed initialization.
4. After this change, if user sets `use_tpu=True`, we automatically add
this to env var.
5. tpu unit test is not available this time, will explore for how to
cover it later.
---------
Signed-off-by: Lehui Liu <lehui@anyscale.com>
commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Wed Nov 12 12:04:16 2025 -0800
[Data] Add `Ranker` Interface (#58513)
Creates a ranker interface that will rank the best operator to run next
in `select_operator_to_run`. This code only refractors the existing
code. The ranking value must be something that is comparable.
None
None
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Nov 12 11:50:42 2025 -0800
[bazel] upgrade bazel python rules to 0.25.0 (#58535)
previously it was actually using 0.4.0, which is set up by the grpc
repo. the declaration in the workspace file was being shadowed..
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 02afe68937429bfd6501e4d0f46780bca4dea329
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Wed Nov 12 11:34:59 2025 -0800
[Data] Refactor concurrency validation tests in `test_map.py` (#58549)
The original `test_concurrency` function combined multiple test
scenarios into a single test with complex control flow and expensive Ray
cluster initialization. This refactoring extracts the parameter
validation tests into focused, independent tests that are faster,
clearer, and easier to maintain.
Additionally, the original test included "validation" cases that tested
valid concurrency parameters but didn't actually verify that concurrency
was being limited correctly—they only checked that the output was
correct, which isn't useful for validating the concurrency feature
itself.
**Key improvements:**
- Split validation tests into `test_invalid_func_concurrency_raises` and
`test_invalid_class_concurrency_raises`
- Use parametrized tests for different invalid concurrency values
- Switch from `shutdown_only` with explicit `ray.init()` to
`ray_start_regular_shared` to eliminate cluster initialization overhead
- Minimize test data from 10 blocks to 1 element since we're only
validating parameter errors
- Remove non-validation tests that didn't verify concurrency behavior
N/A
The validation tests now execute significantly faster and provide
clearer failure messages. Each test has a single, well-defined purpose
making maintenance and debugging easier.
---------
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Wed Nov 12 11:32:48 2025 -0800
[Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523)
This PR improves documentation consistency in the `python/ray/data`
module by converting all remaining rST-style docstrings (`:param:`,
`:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.).
**Files modified:**
- `python/ray/data/preprocessors/utils.py` - Converted
`StatComputationPlan.add_callable_stat()`
- `python/ray/data/preprocessors/encoder.py` - Converted
`unique_post_fn()`
- `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()`
and `BlockColumnAccessor.is_composed_of_lists()`
- `python/ray/data/_internal/datasource/delta_sharing_datasource.py` -
Converted `DeltaSharingDatasource.setup_delta_sharing_connections()`
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 7e872837e450411e9da45acea0c52f4b67221500
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Wed Nov 12 09:07:32 2025 -0800
[serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
commit cd09d104f6d595a805fd8f9979d9f81a828823b5
Author: Alexey Kudinkin <ak@anyscale.com>
Date: Wed Nov 12 11:50:05 2025 -0500
[Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262)
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.
> ⚠️ Remove these instructions before submitting your PR.
> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.
This was setting the value to be aligned with the previous default of 4.
However, after some consideration i've realized that 4 is too high of a
number so actually lowering this to 2
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
commit 126a40bc711cf06ed44686ee5026624d6b78766e
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Wed Nov 12 07:44:53 2025 -0800
[core] fix idle node termination on object pulling (#57928)
Currently, a node is considered idle while pulling objects from the
remote object store. This can lead to situations where a node is
terminated as idle, causing the cluster to enter an infinite loop when
pulling large objects that exceed the node idle termination timeout.
This PR fixes the issue by treating object pulling as a busy activity.
Note that nodes can still accept additional tasks while pulling objects
(since pulling consumes no resources), but the auto-scaler will no
longer terminate the node prematurely.
Closes #54372
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit ad8f30291137efce9e463fb23e6821f4c7c74a9c
Author: Sagar Sumit <sagarsumit09@gmail.com>
Date: Wed Nov 12 05:40:47 2025 -0800
[core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090)
When actors terminate gracefully, Ray calls the actor's
`__ray_shutdown__()` method if defined, allowing for cleanup of
resources. But, this is not invoked in case actor goes out of scope due
to `del actor`.
Traced through the entire code path, and here's what happens:
Flow when `del actor` is called:
1. **Python side**: `ActorHandle.__del__()` ->
`worker.core_worker.remove_actor_handle_reference(actor_id)`
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040
2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` ->
`reference_counter_->RemoveLocalReference()`
- When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed`
callback
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506
3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` ->
`AsyncReportActorOutOfScope()` to GCS
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183
https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51
4. **GCS receives notification**: `HandleReportActorOutOfScope()`
- **THE PROBLEM IS HERE** ([line 279 in
`src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)):
```cpp
DestroyActor(actor_id,
GenActorOutOfScopeCause(actor),
/*force_kill=*/true, // <-- HARDCODED TO TRUE!
[reply, send_reply_callback]() {
```
5. **Actor worker receives kill signal**: `HandleKillActor()` in
[`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970)
```cpp
if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE
ForceExit(...) // Skips __ray_shutdown__
} else {
Exit(...) // Would call __ray_shutdown__
}
```
6. **ForceExit path**: Bypasses graceful shutdown -> No
`__ray_shutdown__` callback invoked.
This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE
actors. Also, updated the docs.
---------
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
commit 15393edbe72f5079279d3a0e46b72adc7496cdfc
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Wed Nov 12 19:00:10 2025 +0530
[Core] use client interceptor for adding auth token in c++ client calls (#58424)
- Use client interceptor for adding auth tokens in grpc calls when
`AUTH_MODE=token`
- BuildChannel() will automatically include the interceptor
- Removed `auth_token` parameter from `ClientCallImpl`
- removed manual auth from `python_gcs_subscriber`.cc
- tests to verify auth works for autoscaller apis
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit d496ea87808706333703be6ff25ecc9472330fd5
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Wed Nov 12 11:25:11 2025 +0530
[core] Token auth usability improvements (#58408)
- rename RAY_auth_mode → RAY_AUTH_MODE environment variable across
codebase
- Excluded healthcheck endpoints from authentication for Kubernetes
compatibility
- Fixed dashboard cookie handling to respect auth mode and clear stale
tokens when switching clusters
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b
Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Date: Tue Nov 11 19:50:52 2025 -0800
[doc][serve][llm] Attached the correct figure to the pd docs (#58543)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
commit a15f5be797ced0df321bfd8d42bab7d57defa2de
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Tue Nov 11 18:00:43 2025 -0800
[doc] downgrade readthedocs to use python 3.10 (#58536)
be consistent with the default build environment
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 17:26:15 2025 -0800
[core] Fix auth test import (#58554)
The python test step is failing on master now because of this. Probably
a logical merge conflict.
```
FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary)
...
[2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import (
--
| [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils'
```
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 20bf68263beed3609e24aede3d9fc96bc07f0da0
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:44:05 2025 -0800
[core][rdt] Abort NIXL and allow actor reuse on failed transfers (#56783)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 89a329cd1e0219629132abc203085117a11949f3
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:26:17 2025 -0800
[core] Improve kill actor logs (#58544)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 6c9607ea57b9edde07c856f094835c84f47b79a6
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Tue Nov 11 12:16:41 2025 -0800
[docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
commit 711d9453828fecebb91b9642e799b4b0b4a493f7
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Nov 11 12:13:13 2025 -0800
[core] Make GlobalState lazy initialization thread-safe (#58182)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb
Author: Kai-Hsun Chen <kaihsun@anyscale.com>
Date: Tue Nov 11 09:43:05 2025 -0800
[core] Scheduling a detached actor with a placement group is not recommended (#57726)
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->
If users schedule a detached actor into a placement group, Raylet will
kill the actor when the placement group is removed. The actor will be
stuck in the `RESTARTING` state forever if it's restartable until users
explicitly kill it.
In that case, if users try to `get_actor` with the actor's name, it can
still return the restarting actor, but no process exists. It will no
longer be restarted because the PG is gone, and no PG with the same ID
will be created during the cluster's lifetime.
The better behavior would be for Ray to transition a task/actor's state
to dead when it is impossible to restart. However, this would add too
much complexity to the core, so I think it's not worth it. Therefore,
this PR adds a warning log, and users should use detached actors or PGs
correctly.
Example: Run the following script and run `ray list actors`.
```python
import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from ray.util.placement_group import placement_group, remove_placement_group
@ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1)
class Actor:
pass
ray.init()
pg = placement_group([{"CPU": 1}])
ray.get(pg.ready())
actor = Actor.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=pg,
)
).remote()
ray.get(actor.__ray_ready__.remote())
```
<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to
- [ ] Bug fix 🐛
- [ ] New feature ✨
- [x] Enhancement 🚀
- [ ] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨
**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->
**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_
**Code Quality:**
- [x] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)
<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->
---------
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 0752886e7d55694b6cf8d780b7470d58266c6a10
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Tue Nov 11 07:19:19 2025 -0800
[core] enable open telemetry by default (#56432)
This PR enables open telemetry as the default backend for ray metric
stack. The bulk of this PR is actually to fix tests that were written
with some assumptions that no longer hold true. For ease of reviewing, I
inline the reasons for the change together with the change for each
tests in the comments.
This PR also depends on a release of vllm (so that we can update the
minimal supported version of vllm in ray).
Test:
- CI
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Enable OpenTelemetry metrics backend by default and refactor
metrics/Serve tests to use timeseries APIs and updated `ray_serve_*`
metric names.
>
> - **Core/Config**:
> - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to
`true` in `ray_constants.py` and `ray_config_def.h`.
> - Metrics `Counter`: use `CythonCount` by default; keep legacy
`CythonSum` only when OTEL is explicitly disabled.
> - **Serve/Metrics Tests**:
> - Replace text scraping with `PrometheusTimeseries` and
`fetch_prometheus_metric_timeseries` throughout.
> - Update metric names/tags to `ray_serve_*` and counter suffixes
`*_total`; adjust latency metric names and processing/queued gauges.
> - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and
pass through helpers.
> - **General Test Fixes**:
> - Remove OTEL parametrization/fixtures; simplify expectations where
counters-as-gauges no longer apply; drop related tests.
> - Cardinality tests: include `"low"` level and remove OTEL gating;
stop injecting `enable_open_telemetry` in system config.
> - Actor/state/thread tests: migrate to cluster fixtures, wait for
dashboard agent, and adjust expected worker thread counts.
> - **Build**:
> - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env
from C++ stats test.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit bf595e32d049503f5c1931c5b477647a06d191c2
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Tue Nov 11 19:15:41 2025 +0530
[Core] move authentication_test_utils into ray._private to fix macos tests (#58528)
the auth token test setup in `conftest.py` is breaking macos test. there
are two test scripts (`test_microbenchmarks.py` and `test_basic.py`)
that run after the wheel is installed but without editable mode. for
these test to pass,` conftest.py` cannot import anything under
`ray.tests`.
this pr moves `authentication_test_utils` into `ray._private` to fix
this issue
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Tue Nov 11 19:10:56 2025 +0530
[Core] Add Service Interceptor to support token authentication in dashboard agent (#58405)
Add a grpc service interceptor to intercept all dashboard agent rpc
calls and validate the presence of auth token (when auth mode is token)
---------
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 1a48e7318442d038f2c43d22da3b580fa643b8d1
Author: curiosity-hyf <curiooosity.h@gmail.com>
Date: Tue Nov 11 21:35:42 2025 +0800
[Docs] fix pattern_async_actor demo typo (#58486)
fix pattern_async_actor demo typo. Add `self.`.
---------
Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>
commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a
Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
Date: Mon Nov 10 18:28:46 2025 -0800
Update pydoclint to version 0.8.1 (#58490)
* Does the work to bump pydoclint up to the latest version
* And allowlist any new violations it finds
n/a
n/a
---------
Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>
commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1
Author: Goutam <goutam@anyscale.com>
Date: Mon Nov 10 17:34:13 2025 -0800
[Data] - Iceberg support predicate & projection pushdown (#58286)
Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in
conjunction with this PR should speed up reads from Iceberg.
Once the above change lands, we can add the pushdown interface support
for IcebergDatasource
---------
Signed-off-by: Goutam <goutam@anyscale.com>
commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Mon Nov 10 17:28:23 2025 -0800
[serve][llm] Fix import path in muli-node release test (#58498)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
commit 405c4648c2fe71afb7daf4ea574605190f129fd7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 16:04:48 2025 -0800
[ci] upgrade rayci version (#58514)
to 0.21.0; supports wanda priority now.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 6de012fd0df23993054653ca5517a66944c58dd2
Author: Zac Policzer <zac@anyscale.com>
Date: Mon Nov 10 14:05:15 2025 -0800
[core] Add owned object spill metrics (#57870)
This PR adds 2 new metrics to core_worker by way of the reference
counter. The two new metrics keep track of the count and size of objects
owned by the worker as well as keeping track of their states. States are
defined as:
- **PendingCreation**: An object that is pending creation and hasn't
finished it's initialization (and is sizeless)
- **InPlasma**: An object which has an assigned node address and isn't
spilled
- **Spilled**: An object which has an assigned node address and is
spilled
- **InMemory**: An object which has no assigned address but isn't
pending creation (and therefore, must be local)
The approach used by these new metrics is to examine the state 'before
and after' any mutations on the reference in the reference_counter. This
is required in order to do the appropriate bookkeeping (decrementing
values and incrementing others). Admittedly, there is potential for
counting on the in between decrements/increments depending on when the
RecordMetrics loop is run. This unfortunate side effect however seems
preferable to doing mutual exclusion with metric collection as this is
potentially a high throughput code path.
In addition, performing live counts seemed preferable then doing full
accounting of the object store and across all references at time of
metric collection. Reason being, that potentially the reference counter
is tracking millions of objects, and each metric scan could potentially
be very expensive. So running the accounting (despite being potentially
innaccurate for short periods) seemed the right call.
This PR also allows for object size to potentially change due to
potential non deterministic instantation (say an object is initially
created, but it's primary copy dies, and then the recreation fails).
This is an edge case, but seems important for completeness sake.
---------
Signed-off-by: zac <zac@anyscale.com>
commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:02:11 2025 -0800
[java] remove local genrule `//java:ray_java_pkg` (#58503)
using `bazelisk run //java:gen_ray_java_pkg` everywhere
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b23adc777c5b103291cf3a35b51b123a808d36f6
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:01:27 2025 -0800
[ci] apply isort to release test directory, part 1 (#58505)
excluding `*_tests` directories for now to reduce the impact
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:01:06 2025 -0800
[doc] change link check to run on python 3.12 (#58506)
migrating all doc related things to run on python 3.12
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b09b076e15fefe842a0b7e33accff71ec3c31435
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 14:00:01 2025 -0800
[doc] ci: move doc annotation check to python 3.12 (#58507)
be consistent with doc build environment
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 8971f83ecb40d54729c2c26d394594c29199e19d
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Mon Nov 10 12:52:43 2025 -0800
[data] Clear queue for manually mark_execution_finished operators (#58441)
Currently, we clear _external_ queues when an operator is manually
marked as finished. But we don't clear their _internal_ queues. This PR
fixes that
Fixes this test
https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit ffb51f866802ad3858d82a9356855a38503efec9
Author: Matthew Owen <mowen@anyscale.com>
Date: Mon Nov 10 10:54:34 2025 -0800
[data] Update depsets for multimodal inference release tests (#57233)
Update remaining mulitmodal release tests to use new depsets.
commit 62231dd4ba8e784da8800b248ad7616b8db92de7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Nov 10 10:30:00 2025 -0800
[ci] seperate doc related jobs into its own group (#58454)
so that they are not called lints any more
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2
Author: harshit-anyscale <harshit@anyscale.com>
Date: Mon Nov 10 23:45:38 2025 +0530
increase timeout for test_initial_replica tests (#58423)
- `test_target_capacity` windows test is failing, possibly because we
have put up a short timeout of 10 seconds, increasing it to verify
whether timeout is an issue or not.
Signed-off-by: harshit <harshit@anyscale.com>
commit 217031a48f4f83d04950ad39b94846ba362edd37
Author: Jugal Shah <47508441+jugalshah291@users.noreply.github.com>
Date: Mon Nov 10 09:39:43 2025 -0800
Define an env for controlling UVloop (#58442)
> Briefly describe what this PR accomplishes and why it's needed.
Our serve ingress keeps running into below error related to `uvloop`
under heavy load
```
File descriptor 97 is used by transport
```
The uvloop team have a
[PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems
like no one is working on it
One of workaround mentioned in the
([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982))
is to just turn off uvloop .
We tried it in our env and didn't see any major performance difference
Hence as part of this PR, we are defining a new env for controlling
UVloop
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
commit 2486ddd9fec83cc940937e3d91368942588ef177
Author: fscnick <6858627+fscnick@users.noreply.github.com>
Date: Mon Nov 10 23:29:03 2025 +0800
[Doc][KubeRay] eliminate vale errors (#58429)
Fix some vale's error and suggestions on the kai-scheduler document.
See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719
Signed-off-by: fscnick <fscnick.dev@gmail.com>
commit cb6a60d0afcfca87734a399291343e297031f1d5
Author: Daniel Sperber <github.blurry@9ox.net>
Date: Mon Nov 10 16:24:34 2025 +0100
[air] Add stacklevel option to deprecation_warning (#58357)
Currently are deprecation warnings sometimes not informative enough. The
the warning is triggered it does not tell us *where* the deprecated
feature is used. For example, ray internally raises a deprecation
warning when an `RLModuleConfig` is initialized.
```python
>>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig
>>> RLModuleConfig()
2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
```
This is confusing, where did *I* use a config, what am I doing wrong?
This raises issues like:
https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064
Tracing where the error actually happens is tedious - is it my code or
internal? The output just shows `deprecation.:50`. Not helpful.
This PR adds a stacklevel option with stacklevel=2 as the default to all
`deprecation_warning`s. So devs and users can better see where is the
deprecated option actually used.
---
EDIT:
**Before**
```python
WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])`
```
**After** module.py:line where the deprecated artifact is used is shown
in the log output:
When building an Algorithm:
```python
WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future!
```
```python
.../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
```
Signed-off-by: Daraan <github.blurry@9ox.net>
commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Mon Nov 10 20:50:21 2025 +0530
[core] Configure an interceptor to pass auth token in python direct g… (#58395)
there are places in the python code where we use the raw grpc library to
make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term
we want to fully deprecate grpc library usage in our python code base
but as that can take more effort and testing, in this pr I am
introducing an interceptor to add auth headers (this will take effect
for all grpc calls made from python).
```
export RAY_auth_mode="token"
export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
ray start --head
ray job submit -- echo "hi"
```
output
```
ray job submit -- echo "hi"
2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads.
Job submission server address: http://127.0.0.1:8265
-------------------------------------------------------
Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_1EV8q86uKM24nHmH
Query the status of the job:
ray job status raysubmit_1EV8q86uKM24nHmH
Request the job to be stopped:
ray job stop raysubmit_1EV8q86uKM24nHmH
Tailing logs until the job exits (disable with --no-wait):
2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up.
hi
Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi
------------------------------------------
Job 'raysubmit_1EV8q86uKM24nHmH' succeeded
------------------------------------------
```
dashboard
test.py
```python
import time
import ray
from ray._raylet import Config
ray.init()
@ray.remote
def print_hi():
print("Hi")
time.sleep(2)
@ray.remote
class SimpleActor:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
actor = SimpleActor.remote()
result = ray.get(actor.increment.remote())
for i in range(100):
ray.get(print_hi.remote())
time.sleep(20)
ray.shutdown()
```
```
export RAY_auth_mode="token"
export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789"
python test.py
```
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292"
/>
overview page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762"
/>
job page: tasks are listed
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a"
/>
task page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136"
/>
actors page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459"
/>
specific actor page
<img width="1720" height="1073" alt="image"
src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0"
/>
---------
Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f
Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com>
Date: Sun Nov 9 08:34:46 2025 -0800
[Data] Add exception handling for invalid URIs in download operation (#58464)
commit d74c1570543045a0f99df4d5690ac44f1fda4a55
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Sat Nov 8 15:35:11 2025 -0800
[dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384)
Pass in `status_code` directly into `do_reply`. This is a follow up to
https://github.com/ray-project/ray/pull/58255
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit e793631896f65a88513510b4e7bf6f100607cb03
Author: Rueian <rueiancsie@gmail.com>
Date: Sat Nov 8 15:32:10 2025 -0800
[core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)
This ensures node type names are correctly reported even when the
autoscaler is disabled (read-only mode).
Autoscaler v2 fails to report prometheus metrics when operating in
read-only mode on KubeRay with the following KeyError error:
```
2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group'
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
return Reconciler.reconcile(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
Reconciler._step_next(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
Reconciler._scale_cluster(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
reply = scheduler.schedule(sched_request)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
node_config = ctx.get_node_type_configs()[node_type]
KeyError: 'small-group'
```
This happens because the `ReadOnlyProviderConfigReader` populates
`ctx.get_node_type_configs()` using node IDs as node types, which is
correct for local Ray (where local ray does not have
`RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
`ray_node_type_name` is present and expected wi…
This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112